Priority Queues

Agenda

  1. Motives
  2. Naive implementation
  3. Heaps
    • Mechanics
    • Implementation
    • Run-time Analysis
  4. Heapsort

1. Motives

Prior to stacks & queues, the sequential data structures we implemented imposed an observable total ordering on all its elements, which were also individually accessible (e.g., by index).

Stacks & Queues restrict access to elements (to only 1 insertion/deletion point), thereby simplifying their implementation. They don't, however, alter the order of the inserted elements.

Data structures that impose a total ordering are useful — e.g., one that maintains all elements in sorted order at all times might come in handy — but their design and implementation are necessarily somewhat complicated. We'll get to them, but before that ...

Is there a middle ground? I.e., is there a place for a data structure that restricts access to its elements, yet maintains an implied (though not necessary total) ordering?

"Priority Queue" Like a queue, a priority queue has a restricted API, but each element has an implicit "priority", such that the element with the highest ("max") priority is always dequeued, regardless of the order in which it was enqueued.

2. Naive implementation

In [141]:
class PriorityQueue:
    def __init__(self):
        self.data = []
        
    def add(self, x): #add to the list sorted by value; here Priority = value of x; What is the run time complexity?
        for i in range(len(self.data)):
            if self.data[i] > x:
                self.data.insert(i, x)
                break
        else:
            self.data.append(x)
    
    def max(self): # return the element with max value, i.e., the last element; What is the run time complexity?
        assert(self) #defined in def __bool__(self):
        return self.data[-1]

    def pop_max(self): # return the element with max value, and remove the element; What is the run time complexity?
        assert(self) 
        rv = self.data[-1]
        del self.data[-1]
        return rv
    
    def __bool__(self):
        return len(self.data) > 0

    def __len__(self):
        return len(self.data)

    def __repr__(self):
        return repr(self.data)
In [142]:
pq = PriorityQueue()
In [143]:
for x in [5, 9, 8, 2, 12, 20]:
    pq.add(x)
pq
Out[143]:
[2, 5, 8, 9, 12, 20]
In [144]:
pq.add(10)
pq
Out[144]:
[2, 5, 8, 9, 10, 12, 20]
In [145]:
pq.max()
Out[145]:
20
In [146]:
pq.pop_max()
Out[146]:
20
In [147]:
pq
Out[147]:
[2, 5, 8, 9, 10, 12]
In [148]:
pq.pop_max()
Out[148]:
12
In [149]:
pq
Out[149]:
[2, 5, 8, 9, 10]
In [150]:
import random
for _ in range(10):
    pq.add(random.randrange(100))
In [151]:
pq
Out[151]:
[2, 5, 8, 9, 10, 13, 21, 46, 56, 65, 67, 74, 84, 87, 87]
In [152]:
while pq:
    print(pq.pop_max())
87
87
84
74
67
65
56
46
21
13
10
9
8
5
2

3. Heaps

Mechanics

In an ordered, linear structure, inserting an element changes the positions of all of its successors, which include all elements at higher indices (positions).

Reframing the problem: how can we reduce the number of successors of elements as we move through them? (Consider analogy: we don't think of all the organisms in the world as belonging to one gigantic, linear list! How do we reduce the number we have to consider when thinking about certain characteristics?)

Use a hierarchical structure! A tree.

Since we only need to access the max value and not intermediate values we will try using an array Heap value at parent >= value at children (for MAX heap) easy to find max value at root, difficult to locate values in the heap

Implementation

In [153]:
class Heap:
    def __init__(self):
        self.data = []

    def add(self, x):
        pass
    
    def max(self):
        pass

    def pop_max(self):
        pass
    
    def __bool__(self):
        return len(self.data) > 0

    def __len__(self):
        return len(self.data)

    def __repr__(self):
        return repr(self.data)
In [154]:
class Heap:
    def __init__(self):
        self.data = []

    @staticmethod
    def _parent(idx):
        return (idx-1)//2
        
    @staticmethod
    def _left(idx):
        return idx*2+1

    @staticmethod
    def _right(idx):
        return idx*2+2
    
    
    def add(self, x):
        self.data.append(x) #initially, add to the end
        idx = len(self.data) - 1 # get the index of the newly added node
        pidx = Heap._parent(idx) # get the index of its parent node 

        while idx > 0 and self.data[pidx] < self.data[idx]:
            self.data[pidx], self.data[idx] = self.data[idx], self.data[pidx] #swap 
            idx = pidx
            pidx = Heap._parent(idx)
 
    def max(self):
        return self.data[0]

    def pop_max(self):
        ret = self.data[0]
        self.data[0] = self.data[-1] # swap the first node with the last node; 
        del self.data[-1] # delete the last node (i.e., the node with max value)
                          # after the above two steps: potentially broken the heap property
            
        self._heapify()  # fix the heap
        return ret
    
    def _heapify(self, idx=0): 
        idx = 0
        max_idx = idx
        
        #note that left and right subtrees are heaps
        while idx < len(self.data):
            lidx = Heap._left(idx)
            ridx = Heap._right(idx)
            
            #compare left child with current_node (i.e., parent)
            if lidx < len(self.data) and self.data[lidx] > self.data[idx]:
                max_idx = lidx
            
            #compare right child with max(parent, left child)
            if ridx < len(self.data) and self.data[ridx] > self.data[max_idx]:
                max_idx = ridx
            
            if max_idx != idx: # the max is not in the parent
                self.data[max_idx], self.data[idx] = self.data[idx], self.data[max_idx]
                idx = max_idx
            
            else: #locate the right position already
                break

                
    #self-reading, recursive version
    def _heapify_rec(self, idx=0):
        lidx = Heap._left(idx)
        ridx = Heap._right(idx)
        maxidx = idx
        if lidx < len(self.data) and self.data[lidx] > self.data[idx]:
            maxidx = lidx
        if ridx < len(self.data) and self.data[ridx] > self.data[maxidx]:
            maxidx = ridx
        if maxidx != idx:
            self.data[idx], self.data[maxidx] = self.data[maxidx], self.data[idx]
            self._heapify(maxidx)
    
    
    def __bool__(self):
        return len(self.data) > 0

    def __len__(self):
        return len(self.data)

    def __repr__(self):
        return repr(self.data)
In [155]:
h = Heap()
In [156]:
#testing add()

for x in (8, 6 ,2, 9, 4, 12):
    h.add(x)
In [125]:
h.data
Out[125]:
[12, 8, 9, 6, 4, 2]
In [126]:
h.add(10)
In [127]:
h.data
Out[127]:
[12, 8, 10, 6, 4, 2, 9]
In [128]:
#testing max() and pop_max()
h.max()
Out[128]:
12
In [129]:
h.pop_max()
Out[129]:
12
In [130]:
h.data
Out[130]:
[10, 8, 9, 6, 4, 2]
In [133]:
import random
for _ in range(10):
    h.add(random.randrange(100))
In [134]:
h 
Out[134]:
[99, 89, 40, 67, 42, 25, 31, 13, 25, 7]
In [135]:
while h:
    print(h.pop_max())
99
89
67
42
40
31
25
25
13
7

Run-time Analysis

4. Heapsort

In [136]:
def heapsort(iterable): # O(NlogN)
    heap = Heap()
    for x in iterable: #do this N times
        heap.add(x) # O(logN)
    
    sorted = []
    
    while heap: # do this N times
        sorted.append(heap.pop_max()) # O(logN)

    sorted.reverse() # O(N)
    return sorted
    
In [137]:
import random

def pairs(iterable):
    it = iter(iterable)
    a = next(it)
    while True:
        b = next(it)
        yield a,b
        a = b

lst = heapsort(random.random() for _ in range(1000))
all((a <= b) for a, b in pairs(lst))
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:12: DeprecationWarning: generator 'pairs' raised StopIteration
  if sys.path[0] == '':
Out[137]:
True
In [138]:
import timeit
def time_heapsort(n):
    return timeit.timeit('heapsort(rlst)',
                         'from __main__ import heapsort; '
                         'import random; '
                         'rlst = (random.random() for _ in range({}))'.format(n),
                         number=1000)
In [139]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

ns = np.linspace(100, 10000, 50, dtype=np.int_)
plt.plot(ns, [time_heapsort(n) for n in ns], 'r+')
plt.show()
In [140]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

ns = np.linspace(100, 10000, 50, dtype=np.int_)
plt.plot(ns, [time_heapsort(n) for n in ns], 'r+')
plt.plot(ns, ns*np.log2(ns)*0.01/10000, 'b') # O(n log n) plot
plt.show()