Channi Studies

[Data Structure] Hash Table 5 - Common Hash Functions 본문

Data Science/Data Structure & Algorithm

[Data Structure] Hash Table 5 - Common Hash Functions

Chan Lee 2025. 4. 10. 04:01

A hash function is fast if the hash function can minimize collisions. 

 

perfect hash fucntion maps items to buckets with no collisions.

It can be created if the number of items and all possible item keys are known beforehand.

The runtime for insert, serach and remove is O(1) with it. 

 

good hash function should uniformly distribute items into buckets.

With chaining, a good hash function results in short bucket lists and thus fast operations. 

With lienar probing, a good hash function will avoid hashing multiple items to consecutive buckets and thus minimize the average linear probing length to achieve fast opeartions. 

On average, a good hash function will achieve O(1) inserts, seraches, and removes, but in the worst-case may require O(N).

 

A hash function's performance depends on the hash table size and knowledge of the expected keys. 

Ex. The hash function key % 10 will perform poorly if the expected keys are all multiples of 10, becuase they will all collide at bucket 0. 

 

 

Let's take a look into some of the common hash functions used. 


Modulo Hash Function

modulo hash uses the raminder from division of the key by hash table size N

HashRemainder(int Key) {
    return key % N
}

 

 

Mid-square Hash Function

mid-square hash squares the key, extracts R digits from the result's middle, and returns the remainder of the middle digits divided by hash table size N. 

Ex. For a hash table with 100 entries and a key of 453, the decimal (base 10) mid-square hash function computes 453 * 453 = 205209, and returns the middle two digits 52. 

For N buckets, R must be greater than equal to [log₁₀ N] to index all buckets. 

The process of squaring and extracting middle digits reduces the likelihood of keys mapping to just a few buckets. 

 

1) For a decimal mid-square hash function, what are the middle digits for key = 40, N = 100, and R = 2?
⇒ 40 * 40 = 1600 → Taking 2 middle digits = 60 
2) For a decimal mid-square hash function, what is the bucket index for key = 110, N = 200, and R = 3?

⇒ 110 * 110 = 12100 → Taking 3 middle digits = 210 → 210 % 200 = 10

3) For a decimal mid-square hash function, what is the bucket index for key = 112, N = 1000, and R = 3?

⇒ 112 * 112 = 12544 → Taking 3 middle digits = 254

 

Mid-square hash function base 2 implementation

However, the mid-square hash function is typically implemented using binary (base 2), and not decimal (base 10). 

A decimal implementation requries converting the square of the key to a string, extracting a substring for the middle digits, and converting that substring to an integer. Meanwhile a binary implementation only requires a few shift and bitwise AND oeprations.

A binary mid-square hash function extracts the middle R bits, and returns the remainder of the middle bits divided by hash table size N, where R is greater than equal to [log₂ N].

 

 

Multiplicative String hash function

multiplicative string hash repeatedly multiples the hash value and adds the ASCII (or Unicode) value of each character in the string. 

A multiplicative hash function for string strats with a large initial value. 

For each caracter, the hash function multiplies the current hash value by a multiplier (often prime) and adds the character's value. 

Finally, the function returns the remainder of the sum divided by the hash table size N.

HashMultiplicative(string key) {
    stringHash = InitialValue
    
    for (each character strChar in key) {
        stringHash = (stringHash * HashMultiplier) + strChar
    }
    
    return stringHash % N
}

 

Daniel J. Bernstein creted a popular version of a multiplicative string hash function that uses an initial value of 5381 and a multiplier of 33. Bernstein's hash function performs well for hashing short English strings.

1) Initial Value = 0, Hash Multiplier = 1. String = BAT, Size = 1000

⇒ A hash multiplier of 1 will simply add the characters' ASCCI values

→ stringHash = initialValue + B + A + T = 0 + 66 + 65 + 84 = 215

2) Initial Value = 0, Hash Multiplier = 1. String = TAB, Size = 1000

→ stringHash = initialValue + T + A + B = 0 + 84 + 65 + 66 = 215 

3) Initial Value = 17, Hash Multiplier = 3. String = WE, Size = 1000

→ stringHash = initialValue = 17

→ stringHash = 17 * 3 + W = 51 + 87 = 138

→ stringHash = 138 * 3  +69 = 483

 

 

Direct Hash Function

direct hash function uses the item's key as the bucket index. (Ex. If the key is 937, the index is also 937.)

A hash table with a direct hash function is called a direct access table.

Given a key, a direct access table serach algorithm returns the item at index key if the bucket is not empty, and returns null if empty. 

 

Limitations of Direct Hashing

A direct access table has the advantage of no collisions: Each key is unique (by definition of a key), and each gets a unique bucket, so no collisions can occur. However, a direct access table has two main limitations:

1. All keys must be non-negative integers, but for some applications keys may be negative. 

2. The hash table's size equals the largest key value plus 1, which may be very large.