Hashtables¶

Agenda¶

Discussion: pros/cons of array-backed and linked structures
Python's other built-in DS: the dict
A naive lookup DS
Direct lookups via Hashing
Hashtables
- Collisions and the "Birthday problem"
Runtime analysis & Discussion

Discussion: pros/cons of array-backed and linked structures¶

Between the array-backed and linked list we have:

$O(1)$ indexing (array-backed)
$O(1)$ appending (array-backed & linked)
$O(1)$ insertion/deletion without indexing (linked)
$O(\log N)$ binary search, when sorted (array-backed)

Python's other built-in DS: the `dict`¶

import timeit

def lin_search(lst, x):
    for i in range(len(lst)):
        if lst[i] == x:
            return i
    raise ValueError(x)
    
def bin_search(lst, x):
    # assume that lst is sorted!!!
    low = 0
    hi  = len(lst)
    mid = (low + hi) // 2
    while lst[mid] != x and low <= hi:
        if lst[mid] < x:
            low = mid + 1
        else:
            hi  = mid - 1
        mid = (low + hi) // 2
    if lst[mid] == x:
        return mid
    else:
        raise ValueError(x)

def time_lin_search(size):
    return timeit.timeit('lin_search(lst, random.randrange({}))'.format(size), # interpolate size into randrange
                         'import random ; from __main__ import lin_search ;'
                         'lst = [x for x in range({})]'.format(size), # interpolate size into list range
                         number=100)

def time_bin_search(size):
    return timeit.timeit('bin_search(lst, random.randrange({}))'.format(size), # interpolate size into randrange
                         'import random ; from __main__ import bin_search ;'
                         'lst = [x for x in range({})]'.format(size), # interpolate size into list range
                         number=100)

def time_dict(size):
    return timeit.timeit('dct[random.randrange({})]'.format(size), 
                         'import random ; '
                         'dct = {{x: x for x in range({})}}'.format(size),
                         number=100)

lin_search_timings = [time_lin_search(n)
                      for n in range(10, 10000, 100)]

bin_search_timings = [time_bin_search(n)
                      for n in range(10, 10000, 100)]

dict_timings = [time_dict(n)
                for n in range(10, 10000, 100)]

%matplotlib inline
import matplotlib.pyplot as plt
#plt.plot(lin_search_timings, 'ro')
plt.plot(bin_search_timings, 'gs')
plt.plot(dict_timings, 'b^')
plt.show()

A naive lookup DS¶

class LookupDS:
    def __init__(self):
        self.data = []
    
    def __setitem__(self, key, value):
        pass
    
    def __getitem__(self, key):
        pass

    def __contains__(self, key):
        pass

class LookupDS:
    def __init__(self):
        self.data = []
    
    def __setitem__(self, key, value):
        for i in range(len(self.data)):
            if self.data[i][0] == key:
                self.data[i][1] = value
                return
        else:
            self.data.append([key, value])
    
    def __getitem__(self, key):
        for k, v in self.data:
            if k == key:
                return v
        else:
            raise KeyError

    def __contains__(self, key):
        try:
            _ = self[key] #calls __getitem__; if getting something return True; if KeyError returns, return False
            return True
        except:
            return False

d = LookupDS()

d['hello'] = 'hola'
d['goodbye'] = 'adios'

d['hello']

'hola'

d['goodbye']

'adios'

d['hello'] = 'bonjour'

d['hello']

'bonjour'

d.data

[['hello', 'bonjour'], ['goodbye', 'adios']]

Direct lookups via Hashing¶

Hashes (a.k.a. hash codes or hash values) are simply numerical values computed for objects.

hash('hello') #the value could be different on your machine

-1298108468397806619

[hash(s) for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]

[8264025059867943528,
 -909818077496650347,
 6562135653832458469,
 -587347941624417982,
 8264025059867943528,
 5601915235208154100]

hash('aa'), hash('ab')

(-595044047162001848, -2323253008525125523)

5093 % 100

93

5093 // 100

50

for i in range(1,20):
    print(i, '% 6 => ', i%6)

1 % 6 =>  1
2 % 6 =>  2
3 % 6 =>  3
4 % 6 =>  4
5 % 6 =>  5
6 % 6 =>  0
7 % 6 =>  1
8 % 6 =>  2
9 % 6 =>  3
10 % 6 =>  4
11 % 6 =>  5
12 % 6 =>  0
13 % 6 =>  1
14 % 6 =>  2
15 % 6 =>  3
16 % 6 =>  4
17 % 6 =>  5
18 % 6 =>  0
19 % 6 =>  1

-4 % 3

2

-4 // 3

-2

hash('hello') % 85

36

Hashtables¶

class Hashtable:
    def __init__(self, n_buckets=1000):
        self.buckets = [None] * n_buckets
        
    def __setitem__(self, key, val):
        pass
    
    def __getitem__(self, key):
        pass
        
    def __contains__(self, key):
        try:
            _ = self[key]
            return True
        except:
            return False

class Hashtable:
    def __init__(self, n_buckets=1000):
        self.buckets = [None] * n_buckets
        
    def __setitem__(self, key, val):
        bidx = hash(key) % len(self.buckets)
        self.buckets[bidx] = val
    
    def __getitem__(self, key):
        bidx = hash(key) % len(self.buckets)
        if self.buckets[bidx]:
            return self.buckets[bidx]
        else:
            raise KeyError
        
    def __contains__(self, key):
        try:
            _ = self[key]
            return True
        except:
            return False

ht = Hashtable()

ht['hello'] = 'hola'
ht['goodbye'] = 'adios'

ht['hello']

'hola'

ht['goodbye']

'adios'

ht.buckets

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 'hola',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 'adios',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

#any problems?

ht = Hashtable(2)

ht['hello'] = 'hola'
ht['byebye'] = 'adios'

ht['hello']

'hola'

ht['byebye']

'adios'

ht['eat'] = 'comer'

ht['hello']

'comer'

ht['eat']

'comer'

ht['byebye']

'adios'

class Hashtable:
    def __init__(self, n_buckets=1000):
        self.buckets = [None] * n_buckets
        
    def __setitem__(self, key, val):
        bidx = hash(key) % len(self.buckets)
        self.buckets[bidx] = [key, val]
    
    def __getitem__(self, key):
        bidx = hash(key) % len(self.buckets)
        if self.buckets[bidx] and self.buckets[bidx][0] == key:
            return self.buckets[bidx][1]
        else:
            raise KeyError
        
    def __contains__(self, key):
        try:
            _ = self[key]
            return True
        except:
            return False

ht = Hashtable(2)

ht['hello'] = 'hola'
ht['byebye'] = 'adios'

ht['hello']

'hola'

ht['byebye']

'adios'

ht['eat']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-75-f7f3c4c7f167> in <module>()
----> 1 ht['eat']

<ipython-input-68-9b5e5e06061c> in __getitem__(self, key)
     12             return self.buckets[bidx][1]
     13         else:
---> 14             raise KeyError
     15 
     16     def __contains__(self, key):

KeyError:

On Collisions¶

The "Birthday Problem"¶

Problem statement: Given $N$ people at a party, how likely is it that at least two people will have the same birthday?

# Explanation of "Birthday Paradox"

# What is the probability that two people in the room have the same birthday?
# P(n) = 1 – (365! / (365n * (365-n)!) 
# because: any 2 persons do NOT have the same bDay is 
# 364/365 * 363/365 * 362/365 … 365-(n-1)/365 
# = [365! / (365-n)!] / 365n

# Observation: P(n) > 0.5 for n = 23

def birthday_p(n_people):
    p_inv = 1
    for n in range(365, 365-n_people, -1):
        p_inv *= n / 365
    return 1 - p_inv

birthday_p(2)

0.002739726027397249

birthday_p(10)

0.11694817771107768

birthday_p(50)

0.9703735795779884

%matplotlib inline
import matplotlib.pyplot as plt

n_people = range(1, 80)
plt.plot(n_people, [birthday_p(n) for n in n_people])
plt.show()

General collision statistics¶

Repeat the birthday problem, but with a given number of values and "buckets" that are allotted to hold them. How likely is it that two or more values will map to the same bucket?

def collision_p(n_values, n_buckets):
    p_inv = 1
    for n in range(n_buckets, n_buckets-n_values, -1):
        p_inv *= n / n_buckets
    return 1 - p_inv

collision_p(23, 365) # same as birthday problem, for 23 people, 365 days

0.5072972343239857

collision_p(10, 100) # number of values = 10,  number of buckets = 100

0.37184349044470544

collision_p(100, 1000) # number of values = 100,  number of buckets = 1,000

0.9940410733677595

collision_p(100, 10000) # number of values = 100,  number of buckets = 10,000

0.3914340350427218

# keeping number of values fixed at 100, but vary number of buckets: visualize probability of collision
%matplotlib inline
import matplotlib.pyplot as plt

n_buckets = range(100, 100001, 1000)
plt.plot(n_buckets, [collision_p(100, nb) for nb in n_buckets])
plt.show()

def avg_num_collisions(n, b):
    """Returns the expected number of collisions for n values uniformly distributed
    over a hashtable of b buckets. Based on (fairly) elementary probability theory.
    (Pay attention in MATH 474!)"""
    return n - b + b * (1 - 1/b)**n

avg_num_collisions(28, 365) #number of values = 28,  number of buckets = 365

1.011442040700615

avg_num_collisions(1000, 1000)

367.6954247709637

avg_num_collisions(1000, 10000)

48.32893558556316

Dealing with Collisions¶

To deal with collisions in a hashtable, we simply create a "chain" of key/value pairs for each bucket where collisions occur. The chain needs to be a data structure that supports quick insertion — natural choice: the linked list!

class Hashtable:
    class Node:
        def __init__(self, key, val, next=None):
            self.key = key
            self.val = val
            self.next = next
            
    def __init__(self, n_buckets=1000):
        self.buckets = [None] * n_buckets
        
    def __setitem__(self, key, val):
        bucket_idx = hash(key) % len(self.buckets)
        pass

    def __getitem__(self, key):
        bucket_idx = hash(key) % len(self.buckets)
        pass
        
    def __contains__(self, key):
        try:
            _ = self[key]
            return True
        except:
            return False

class Hashtable:
    class Node:
        def __init__(self, key, val, next=None):
            self.key = key
            self.val = val
            self.next = next
            
    def __init__(self, n_buckets=1000):
        self.buckets = [None] * n_buckets
        
    def __setitem__(self, key, val):
        bucket_idx = hash(key) % len(self.buckets)
        
        # code logic
        # get the node at the bucket_idx [] location, 
        # while not none, walk the list looking for the key, 
            # if found, set the value
        # else insert a new node at start of the list
        
        if not self.buckets[bucket_idx]:
            # chain is empty
            self.buckets[bucket_idx] = Hashtable.Node(key, val)
        else:
            n = self.buckets[bucket_idx]
            while n:
                if n.key == key: #found the key in an existing node
                    n.val = val
                    break
                n = n.next
            else: #prepend the node
                self.buckets[bucket_idx] = Hashtable.Node(key, val, 
                                                          next=self.buckets[bucket_idx])

    def __getitem__(self, key):
        bucket_idx = hash(key) % len(self.buckets)
        
        #code logic
        # get the node at the bucket_idx [] location
        # while not none, walk the list looking for the key
            # if found, return the value
        #else raise KeyError
        
        n = self.buckets[bucket_idx]
        while n:
            if n.key == key:
                return n.val
            n = n.next
        raise KeyError

    def __contains__(self, key):
        try:
            _ = self[key]
            return True
        except:
            return False

ht = Hashtable(2)

for k, v in (('a', 'apple'), ('b', 'banana'), ('c', 'cat')):
    ht[k] = v

for k in 'abc':
    print(k, '=>', ht[k])

a => apple
b => banana
c => cat

# do we really get O(1) access?

def prep_ht(size):
    ht = Hashtable(size*10)
    for x in range(size):
        ht[x] = x
    return ht

def time_ht(size):
    return timeit.timeit('ht[random.randrange({})]'.format(size), 
                         'import random ; from __main__ import prep_ht ;'
                         'ht = prep_ht({})'.format(size),
                         number=100)
# Explain timeit()

# test code
#     ht[random.randrange({})]'.format(size)
      # ht[100] ht[99] ht[10]

# setup code
#     import random
#     from __main__ import prep_ht
#     ht = prep_ht({}).format(size)
    # ht = prep_ht(100)

# number executions = 100 

# order of execution: 
#    setup code (once)
#    repeat test code 100 times
# return the time measurement

ht_timings = [time_ht(n)
                for n in range(10, 10000, 100)]

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(bin_search_timings, 'ro')
plt.plot(ht_timings, 'gs')
plt.plot(dict_timings, 'b^')
plt.show()

Loose ends¶

Iteration¶

class Hashtable(Hashtable):
    def __iter__(self):
        pass

class Hashtable(Hashtable):
    def __iter__(self):
        # walk the bucket list
            # walk the linked list 
                # yield keys as you go
        
        for b in self.buckets:
            while b:
                yield b.key
                b = b.next

ht = Hashtable(10)

for k, v in (('a', 'apple'), ('b', 'banana'), ('c', 'cat')):
    ht[k] = v

#test iteration
for k in ht:
    print(k, '=>', ht[k])

a => apple
b => banana
c => cat

"Load factor" and Rehashing¶

It doesn't often make sense to start with a large number of buckets, unless we know in advance that the number of keys is going to be vast — also, the user of the hashtable would typically prefer to not be bothered with implementation details (i.e., bucket count) when using the data structure.

Instead: start with a relatively small number of buckets, and if the ratio of keys to the number of buckets (known as the load factor) is above some desired threshold — which we can determine using collision probabilities — we can dynamically increase the number of buckets. This requires, however, that we rehash all keys and potentially move them into new buckets (since the hash(key) % num_buckets mapping will likely be different with more buckets).

Other APIs¶

FIXED __setitem__ (to update value for existing key)
__delitem__
keys & values (return iterators for keys and values)
setdefault

Runtime analysis & Discussion¶

For a hashtable with $N$ key/value entries:

Insertion: $O(N)$
Lookup: $O(N)$
Deletion: $O(N)$

A few words about the HashTable Lab¶

It is often convenient to be able to iterate over the keys in a hashtable in the order in which they were first inserted. But our implementation makes this impossible because of the unpredictable nature of the hash function.

We can fix this by introducing a separate array-backed list (we call this "entries") to which we append key/value pairs as they are added to the hashtable. To facilitate rapid search, we will continue to use the original buckets + chains structure, but it will merely contain indices of entries in the first list (we call this structure "indices").

The operations will work as follows:

get will use the hashcode of the provided key to locate the appropriate bucket in entries, then iterate over indices in the attached linked-list chain to search for a matching key in entries.
set will conduct the same search as get to see if the provided key is already in entries; if it is, the value will be updated. If the key isn't already in the hashtable, the key/value pair will be appended to entries, and a new node will be inserted into indices (in the correct bucket).
del will conduct the same search as get; if the key is present, its entries slot will simply be set to None (so that it can be skipped during iteration), and the corresponding node deleted from indices
iteration: is easy, just walk through the "entries" (an array-backed list)

Vocabulary list¶

hashtable
hashing and hashes
collision
hash buckets & chains
birthday problem
load factor
rehashing

Addendum: On Hashability¶

Remember: a given object must always hash to the same value. This is required so that we can always map the object to the same hash bucket.

Hashcodes for collections of objects are usually computed from the hashcodes of its contents, e.g., the hash of a tuple is a function of the hashes of the objects in said tuple:

hash(('two', 'strings'))

This is useful. It allows us to use a tuple, for instance, as a key for a hashtable.

However, if the collection of objects is mutable — i.e., we can alter its contents — this means that we can potentially change its hashcode.`

If we were to use such a collection as a key in a hashtable, and alter the collection after it's been assigned to a particular bucket, this leads to a serious problem: the collection may now be in the wrong bucket (as it was assigned to a bucket based on its original hashcode)!

For this reason, only immutable types are, by default, hashable in Python. So while we can use integers, strings, and tuples as keys in dictionaries, lists (which are mutable) cannot be used. Indeed, Python marks built-in mutable types as "unhashable", e.g.,

hash([1, 2, 3])

That said, Python does support hashing on instances of custom classes (which are mutable). This is because the default hash function implementation does not rely on the contents of instances of custom classes. E.g.,

class Student:
    def __init__(self, fname, lname):
        self.fname = fname
        self.lname = lname

s = Student('John', 'Doe')
hash(s)

s.fname = 'Jane'
hash(s) # same as before mutation

We can change the default behavior by providing our own hash function in __hash__, e.g.,

class Student:
    def __init__(self, fname, lname):
        self.fname = fname
        self.lname = lname
        
    def __hash__(self):
        return hash(self.fname) + hash(self.lname)

s = Student('John', 'Doe')
hash(s)

s.fname = 'Jane'
hash(s)

But be careful: instances of this class are no longer suitable for use as keys in hashtables (or dictionaries), if you intend to mutate them after using them as keys!

Hashtables¶

Agenda¶

Discussion: pros/cons of array-backed and linked structures¶

Python's other built-in DS: the dict¶

A naive lookup DS¶

Direct lookups via Hashing¶

Hashtables¶

On Collisions¶

The "Birthday Problem"¶

General collision statistics¶

Dealing with Collisions¶

Loose ends¶

Iteration¶

"Load factor" and Rehashing¶

Other APIs¶

Runtime analysis & Discussion¶

A few words about the HashTable Lab¶

Vocabulary list¶

Addendum: On Hashability¶

Python's other built-in DS: the `dict`¶