{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hashtables\n", "\n", "## Agenda\n", "\n", "- Discussion: pros/cons of array-backed and linked structures\n", "- Python's other built-in DS: the `dict`\n", "- A naive lookup DS\n", "- Direct lookups via *Hashing*\n", "- Hashtables\n", " - Collisions and the \"Birthday problem\"\n", "- Runtime analysis & Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion: pros/cons of array-backed and linked structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Between the array-backed and linked list we have:\n", "\n", "1. $O(1)$ indexing (array-backed)\n", "2. $O(1)$ appending (array-backed & linked)\n", "3. $O(1)$ insertion/deletion without indexing (linked)\n", "4. $O(\\log N)$ binary search, when sorted (array-backed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python's other built-in DS: the `dict`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import timeit\n", "\n", "def lin_search(lst, x):\n", " for i in range(len(lst)):\n", " if lst[i] == x:\n", " return i\n", " raise ValueError(x)\n", " \n", "def bin_search(lst, x):\n", " # assume that lst is sorted!!!\n", " low = 0\n", " hi = len(lst)\n", " mid = (low + hi) // 2\n", " while lst[mid] != x and low <= hi:\n", " if lst[mid] < x:\n", " low = mid + 1\n", " else:\n", " hi = mid - 1\n", " mid = (low + hi) // 2\n", " if lst[mid] == x:\n", " return mid\n", " else:\n", " raise ValueError(x)\n", "\n", "def time_lin_search(size):\n", " return timeit.timeit('lin_search(lst, random.randrange({}))'.format(size), # interpolate size into randrange\n", " 'import random ; from __main__ import lin_search ;'\n", " 'lst = [x for x in range({})]'.format(size), # interpolate size into list range\n", " number=100)\n", "\n", "def time_bin_search(size):\n", " return timeit.timeit('bin_search(lst, random.randrange({}))'.format(size), # interpolate size into randrange\n", " 'import random ; from __main__ import bin_search ;'\n", " 'lst = [x for x in range({})]'.format(size), # interpolate size into list range\n", " number=100)\n", "\n", "def time_dict(size):\n", " return timeit.timeit('dct[random.randrange({})]'.format(size), \n", " 'import random ; '\n", " 'dct = {{x: x for x in range({})}}'.format(size),\n", " number=100)\n", "\n", "lin_search_timings = [time_lin_search(n)\n", " for n in range(10, 10000, 100)]\n", "\n", "bin_search_timings = [time_bin_search(n)\n", " for n in range(10, 10000, 100)]\n", "\n", "dict_timings = [time_dict(n)\n", " for n in range(10, 10000, 100)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.plot(lin_search_timings, 'ro')\n", "plt.plot(bin_search_timings, 'gs')\n", "plt.plot(dict_timings, 'b^')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A naive lookup DS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LookupDS:\n", " def __init__(self):\n", " self.data = []\n", " \n", " def __setitem__(self, key, value):\n", " pass\n", " \n", " def __getitem__(self, key):\n", " pass\n", " \n", " def __contains__(self, key):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "l = LookupDS()\n", "l['batman'] = 'bruce wayne'\n", "l['superman'] = 'clark kent'\n", "l['spiderman'] = 'peter parker'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Direct lookups via *Hashing*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hashes (a.k.a. hash codes or hash values) are simply numerical values computed for objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hash('hello')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hash('batman')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hash('batmen')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[hash(s) for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[i%100 for i in range(10, 1000, 40)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[hash(s)%100 for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hashtables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Hashtable:\n", " def __init__(self, n_buckets=1000):\n", " self.buckets = [None] * n_buckets\n", " \n", " def __setitem__(self, key, val):\n", " pass\n", " \n", " def __getitem__(self, key):\n", " pass\n", " \n", " def __contains__(self, key):\n", " try:\n", " _ = self[key]\n", " return True\n", " except:\n", " return False" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ht = Hashtable(10)\n", "ht['batman'] = 'bruce wayne'\n", "ht['superman'] = 'clark kent'\n", "ht['spiderman'] = 'peter parker'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## On Collisions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The \"Birthday Problem\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Problem statement: Given $N$ people at a party, how likely is it that at least two people will have the same birthday?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def birthday_p(n_people):\n", " p_inv = 1\n", " for n in range(365, 365-n_people, -1):\n", " p_inv *= n / 365\n", " return 1 - p_inv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "birthday_p(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "1-364/365" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "n_people = range(1, 80)\n", "plt.plot(n_people, [birthday_p(n) for n in n_people])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### General collision statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeat the birthday problem, but with a given number of values and \"buckets\" that are allotted to hold them. How likely is it that two or more values will map to the same bucket?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def collision_p(n_values, n_buckets):\n", " p_inv = 1\n", " for n in range(n_buckets, n_buckets-n_values, -1):\n", " p_inv *= n / n_buckets\n", " return 1 - p_inv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "collision_p(23, 365) # same as birthday problem, for 23 people" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "collision_p(10, 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "collision_p(100, 1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# keeping number of values fixed at 100, but vary number of buckets: visualize probability of collision\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "n_buckets = range(100, 100001, 1000)\n", "plt.plot(n_buckets, [collision_p(100, nb) for nb in n_buckets])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def avg_num_collisions(n, b):\n", " \"\"\"Returns the expected number of collisions for n values uniformly distributed\n", " over a hashtable of b buckets. Based on (fairly) elementary probability theory.\n", " (Pay attention in MATH 474!)\"\"\"\n", " return n - b + b * (1 - 1/b)**n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg_num_collisions(28, 365)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg_num_collisions(1000, 1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg_num_collisions(1000, 10000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dealing with Collisions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To deal with collisions in a hashtable, we simply create a \"chain\" of key/value pairs for each bucket where collisions occur. The chain needs to be a data structure that supports quick insertion — natural choice: the linked list!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Hashtable:\n", " class Node:\n", " def __init__(self, key, val, next=None):\n", " self.key = key\n", " self.val = val\n", " self.next = next\n", " \n", " def __init__(self, n_buckets=1000):\n", " self.buckets = [None] * n_buckets\n", " \n", " def __setitem__(self, key, val):\n", " bucket_idx = hash(key) % len(self.buckets)\n", " pass\n", " \n", " def __getitem__(self, key):\n", " bucket_idx = hash(key) % len(self.buckets)\n", " pass\n", " \n", " def __contains__(self, key):\n", " try:\n", " _ = self[key]\n", " return True\n", " except:\n", " return False" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ht = Hashtable(1)\n", "ht['batman'] = 'bruce wayne'\n", "ht['superman'] = 'clark kent'\n", "ht['spiderman'] = 'peter parker'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def prep_ht(size):\n", " ht = Hashtable(size*10)\n", " for x in range(size):\n", " ht[x] = x\n", " return ht\n", "\n", "def time_ht(size):\n", " return timeit.timeit('ht[random.randrange({})]'.format(size), \n", " 'import random ; from __main__ import prep_ht ;'\n", " 'ht = prep_ht({})'.format(size),\n", " number=100)\n", "\n", "ht_timings = [time_ht(n)\n", " for n in range(10, 10000, 100)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.plot(bin_search_timings, 'ro')\n", "plt.plot(ht_timings, 'gs')\n", "plt.plot(dict_timings, 'b^')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loose ends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Iteration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Hashtable(Hashtable):\n", " def __iter__(self):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ht = Hashtable(1)\n", "ht['batman'] = 'bruce wayne'\n", "ht['superman'] = 'clark kent'\n", "ht['spiderman'] = 'peter parker'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for k in ht:\n", " print(k)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key ordering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ht = Hashtable()\n", "d = {}\n", "for x in 'apple banana cat dog elephant'.split():\n", " d[x[0]] = x\n", " ht[x[0]] = x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for k in d:\n", " print(k, '=>', d[k])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for k in ht:\n", " print(k, '=>', ht[k])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### \"Load factor\" and Rehashing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It doesn't often make sense to start with a large number of buckets, unless we know in advance that the number of keys is going to be vast — also, the user of the hashtable would typically prefer to not be bothered with implementation details (i.e., bucket count) when using the data structure.\n", "\n", "Instead: start with a relatively small number of buckets, and if the ratio of keys to the number of buckets (known as the **load factor**) is above some desired threshold — which we can determine using collision probabilities — we can dynamically increase the number of buckets. This requires, however, that we *rehash* all keys and potentially move them into new buckets (since the `hash(key) % num_buckets` mapping will likely be different with more buckets)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other APIs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- FIXED `__setitem__` (to update value for existing key)\n", "- `__delitem__`\n", "- `keys` & `values` (return iterators for keys and values)\n", "- `setdefault`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Runtime analysis & Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a hashtable with $N$ key/value entries:\n", "\n", "- Insertion: $O(?)$\n", "- Lookup: $O(?)$\n", "- Deletion: $O(?)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vocabulary list\n", "\n", "- hashtable\n", "- hashing and hashes\n", "- collision\n", "- hash buckets & chains\n", "- birthday problem\n", "- load factor\n", "- rehashing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Addendum: On *Hashability*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember: *a given object must always hash to the same value*. This is required so that we can always map the object to the same hash bucket.\n", "\n", "Hashcodes for collections of objects are usually computed from the hashcodes of its contents, e.g., the hash of a tuple is a function of the hashes of the objects in said tuple:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hash(('two', 'strings'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is useful. It allows us to use a tuple, for instance, as a key for a hashtable.\n", "\n", "However, if the collection of objects is *mutable* — i.e., we can alter its contents — this means that we can potentially change its hashcode.`\n", "\n", "If we were to use such a collection as a key in a hashtable, and alter the collection after it's been assigned to a particular bucket, this leads to a serious problem: the collection may now be in the wrong bucket (as it was assigned to a bucket based on its original hashcode)!\n", "\n", "For this reason, only immutable types are, by default, hashable in Python. So while we can use integers, strings, and tuples as keys in dictionaries, lists (which are mutable) cannot be used. Indeed, Python marks built-in mutable types as \"unhashable\", e.g.," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hash([1, 2, 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That said, Python does support hashing on instances of custom classes (which are mutable). This is because the default hash function implementation does not rely on the contents of instances of custom classes. E.g.," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Student:\n", " def __init__(self, fname, lname):\n", " self.fname = fname\n", " self.lname = lname" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = Student('John', 'Doe')\n", "hash(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.fname = 'Jane'\n", "hash(s) # same as before mutation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can change the default behavior by providing our own hash function in `__hash__`, e.g.," ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Student:\n", " def __init__(self, fname, lname):\n", " self.fname = fname\n", " self.lname = lname\n", " \n", " def __hash__(self):\n", " return hash(self.fname) + hash(self.lname)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = Student('John', 'Doe')\n", "hash(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.fname = 'Jane'\n", "hash(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But be careful: instances of this class are no longer suitable for use as keys in hashtables (or dictionaries), if you intend to mutate them after using them as keys!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }