Friday, October 10, 2014

Hashtable (Introduction to Hashing)

“Hashing is a technique used for storing and retrieving information as fast as possible. It is perhaps optimal the search and is useful in implementing symbol tables”

Understanding Hashing

In simple words one can treat array as a hash table.  The common operations on hash table (ADT) are, create hash table, searches the key in hash table, insert into hash table, delete a key from hash table and delete the complete hash table. To understand the hashing in more depth, let’s consider an example.

Problem Scenario: - Given an algorithm to get the first repeated character if there are duplicate elements in it.
Solution: - The simple and brute force way of solving it is: given a string, for each character check weather that character is repeated character or not. Time complexity if this program is O (n2) with O (1) space complexity. Now, let’s find a better solution for that. What is we remember the previous characters in some array?
Assuming ASCII characters only, we know the possible number of character is 256. So, let’s create an array of sixe 256 and initialize it with all zeros. Now, for each of the input character goto the corresponding position and increment its count. Since we are using an array, it will take constant time reaching any location.­­ while scanning if the character whose count is already 1 then we can say that the character is the one which is repeating first time.

char firstRepeatedChar(char [] str){               
  int count[256]; //additional array
                for(int i=0;i<256 i="" o:p="">
                                count[i] = 0;
                               
                for(int i=0; i
                                if(count[str[i]] == 1){
                                                return str[i];
                                } else {
                                                count[str[i]]++;
                                }
                }
                return 0; // no repeated character found
}


Following question: what if the given array has numbers instead of characters, then how do we solve the problem?
Solution: In this case the number of possible values is infinite (or at least very big). Creating that big array, and storing the count is not possible. That means there is a set of universal keys and limited locations in the memory. Of we want to solve the problem, we need to somehow map all these possible keys to the possible memory locations. As a result, using a simple array is not the correct choice for solving the problem, so this process of mapping the keys to locations is called Hashing. To resolve this problem, we can use Hashtable, which will easily solve our problem (do not worry about the hashtable, I will cover the details in following section).

Components of Hashing

  • -          Hash Table
  • -          Hash functions
  • -          Collision
  • -          Collision Resolution Techniques


Hash table is generalization of Arrays, in array we use direct addressing, which is applicable when we can afford to allocate an array with one position for every possible key. So, if we have less locations and more possible keys then simple array implementation is not enough. In these cases the opinion is to use hash tables. Hash table stores the keys and their associated values. Hash table uses the hash function to may keys to their associated values.

The hash function is used to transform the key into the index, ideally the hash function should map each possible key to a unique index, but it is difficult to achieve in practice. An efficient hash function should be designed so that it distributes the index values of inserted objects uniformly across the table. It should minimize collision, easy and quick to compute, and have a high load factor for a given set of keys.

Load factor = Number of elements in hash table / Hash table size

Load factor is a decision parameter used when we want to rehash or expand the existing hash table entries.

Hash function are used to map each key to different address space but practically it is not possible to create such a hash function and the problem is called Collision. Collision is the condition where two records are stored in same hash location.

The process of finding an alternate location is called Collision Resolution. Hashing are more efficient in many cases comparative to all other data structures like search tree. From the number of collision resolution techniques, the most popular are Open Addressing & Chaining.

Direct Chaining (An array of linked list applications):
  • -          Separate Chaining

Open Addressing (Array based implementation):
  • -          Linear probing (linear search),
  • -          Quadratic probing (nonlinear search)
  • -          Double Hashing (use two hash functions)

Separate Chaining: -

When two or more records hash to the same location, then records are constituted into a single linked list called Chain




In this method each bucket is independent. The time for hash table operations is the time to find the bucket (which is constant) plus the time for list operation. In a good hash table, each bucket has zero or one entries, sometimes two or three, but rarely more than that.

Linear Probing: -

In open addressing all keys are stored in the hash table itself. This approach is also known as closed hashing. This procedure is based on probing. A collision is resolved by probing. Interval between probes is fixed at 1. Here we search the hash table sequentially starting from the original hash location. If the location is occupied, we check the next location. We wrap around from last table location to first table location if necessary.
rehash(key) = (n+1)%tablesize



One of the problem with linear probing is that table items tend to cluster together in the hash table. This caused as a group of consecutive occupied locations that are called clustering. Thus one part of table might be quite dense, even though another part has few items, this results in decrease of overall efficiency.

Quadratic Probing: -

Clustering problem can be eliminated if we use quadratic probing method. In quadratic probing, we start from the original hash location i. If a location is occupied, we check the locations i+1*1, I + 2*2, i+3*3, I+4*4 … We wrap around from the last table location to the first table location if necessary. The function for the rehashing is the following:

rehash(key) = (n+k*k)%tablesize

Algorithm to insert key in hash table:
1. Get the key k
 2. Set counter j = 0
 3. Compute hash function h[k] = k % SIZE
 4. If hashtable[h[k]] is empty
         (4.1) Insert key k at hashtable[h[k]]
         (4.2) Stop
    Else
        (4.3) The key space at hashtable[h[k]] is occupied, so we need to find the next available key space
        (4.4) Increment j
        (4.5) Compute new hash function h[k] = ( k + j * j ) % SIZE
        (4.6) Repeat Step 4 till j is equal to the SIZE of hash table
 5. The hash table is full
 6. Stop


Double Hashing:

Interval between probes is computed by another hash function. Double hashing reduces clustering in a better way. The increments for the probing sequence are computed by second hash function. The second hash function h2 should be:

h2(key) != 0 and h2 != h1



Reference: