“Hashing is a
technique used for storing and retrieving information as fast as possible. It
is perhaps optimal the search and is useful in implementing symbol tables”
Understanding Hashing
In simple words one can treat
array as a hash table. The common
operations on hash table (ADT) are, create hash table, searches the key in hash
table, insert into hash table, delete a key from hash table and delete the
complete hash table. To understand the hashing in more depth, let’s consider an
example.
Problem Scenario: - Given an algorithm to get the first repeated
character if there are duplicate elements in it.
Solution: - The simple and brute force way of solving it is: given
a string, for each character check weather that character is repeated character
or not. Time complexity if this program is O (n2) with O (1) space complexity.
Now, let’s find a better solution for that. What is we remember the previous
characters in some array?
Assuming ASCII characters only,
we know the possible number of character is 256. So, let’s create an array of
sixe 256 and initialize it with all zeros. Now, for each of the input character
goto the corresponding position and increment its count. Since we are using an
array, it will take constant time reaching any location. while scanning if
the character whose count is already 1 then we can say that the character is
the one which is repeating first time.
char firstRepeatedChar(char [] str){
int count[256]; //additional
array
for(int i=0;i<256 i="" o:p="">256>
count[i] = 0;
for(int i=0; i
if(count[str[i]]
== 1){
return
str[i];
} else {
count[str[i]]++;
}
}
return 0; // no repeated
character found
}
Following question: what if the given array has numbers instead of
characters, then how do we solve the problem?
Solution: In this case the number of possible values is infinite
(or at least very big). Creating that big array, and storing the count is not
possible. That means there is a set of universal keys and limited locations in
the memory. Of we want to solve the problem, we need to somehow map all these
possible keys to the possible memory locations. As a result, using a simple
array is not the correct choice for solving the problem, so this process of mapping the keys to
locations is called Hashing. To resolve this problem, we can use Hashtable,
which will easily solve our problem (do not worry about the hashtable, I will
cover the details in following section).
Components of Hashing
- - Hash Table
- - Hash functions
- - Collision
- - Collision Resolution Techniques
Hash table is generalization of
Arrays, in array we use direct addressing, which is applicable when we can
afford to allocate an array with one position for every possible key. So, if we
have less locations and more possible keys then simple array implementation is
not enough. In these cases the opinion is to use hash tables. Hash table stores
the keys and their associated values. Hash table uses the hash function to may
keys to their associated values.
The hash function is used to
transform the key into the index, ideally the hash function should map each
possible key to a unique index, but it is difficult to achieve in practice. An
efficient hash function should be designed so that it distributes the index
values of inserted objects uniformly across the table. It should minimize
collision, easy and quick to compute, and have a high load factor for a given
set of keys.
Load factor
= Number of elements in hash table / Hash table size
Load factor is a decision
parameter used when we want to rehash or expand the existing hash table
entries.
Hash function are used to map
each key to different address space but practically it is not possible to
create such a hash function and the problem is called Collision. Collision is the condition where two records are stored
in same hash location.
The process of finding an
alternate location is called Collision
Resolution. Hashing are more efficient in many cases comparative to all other
data structures like search tree. From the number of collision resolution
techniques, the most popular are Open
Addressing & Chaining.
Direct Chaining (An array of
linked list applications):
- - Separate Chaining
Open Addressing (Array based implementation):
- - Linear probing (linear search),
- - Quadratic probing (nonlinear search)
- - Double Hashing (use two hash functions)
Separate Chaining: -
When two or more records hash to
the same location, then records are constituted into a single linked list called
Chain
In this method each bucket is
independent. The time for hash table operations is the time to find the bucket
(which is constant) plus the time for list operation. In a good hash table,
each bucket has zero or one entries, sometimes two or three, but rarely more
than that.
Linear Probing: -
In open
addressing all keys are stored in the hash table itself. This approach is also
known as closed hashing. This procedure is based on probing. A collision is resolved by probing. Interval between probes
is fixed at 1. Here we search the hash table sequentially starting from the
original hash location. If the location is occupied, we check the next
location. We wrap around from last table location to first table location if
necessary.
rehash(key) = (n+1)%tablesize
One of the
problem with linear probing is that table items tend to cluster together in the
hash table. This caused as a group of
consecutive occupied locations that
are called clustering. Thus one part
of table might be quite dense, even though another part has few items, this
results in decrease of overall efficiency.
Quadratic Probing: -
Clustering
problem can be eliminated if we use quadratic probing method. In quadratic
probing, we start from the original hash location i. If a location is occupied,
we check the locations i+1*1, I + 2*2, i+3*3, I+4*4 … We wrap around from the
last table location to the first table location if necessary. The function for
the rehashing is the following:
rehash(key) =
(n+k*k)%tablesize
Algorithm to
insert key in hash table:
1. Get the key k
2. Set counter j = 0
3. Compute hash
function h[k] = k % SIZE
4. If
hashtable[h[k]] is empty
(4.1) Insert
key k at hashtable[h[k]]
(4.2) Stop
Else
(4.3) The key
space at hashtable[h[k]] is occupied, so we need to find the next available key
space
(4.4)
Increment j
(4.5) Compute
new hash function h[k] = ( k + j * j ) % SIZE
(4.6) Repeat
Step 4 till j is equal to the SIZE of hash table
5. The hash table is
full
6. Stop
Double Hashing:
Interval between probes is
computed by another hash function. Double hashing reduces clustering in a
better way. The increments for the probing sequence are computed by second hash
function. The second hash function h2 should be:
h2(key) != 0 and h2 != h1
Reference: