Home | History | Annotate | Download | only in Lib
      1 """Heap queue algorithm (a.k.a. priority queue).
      2 
      3 Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for
      4 all k, counting elements from 0.  For the sake of comparison,
      5 non-existing elements are considered to be infinite.  The interesting
      6 property of a heap is that a[0] is always its smallest element.
      7 
      8 Usage:
      9 
     10 heap = []            # creates an empty heap
     11 heappush(heap, item) # pushes a new item on the heap
     12 item = heappop(heap) # pops the smallest item from the heap
     13 item = heap[0]       # smallest item on the heap without popping it
     14 heapify(x)           # transforms list into a heap, in-place, in linear time
     15 item = heapreplace(heap, item) # pops and returns smallest item, and adds
     16                                # new item; the heap size is unchanged
     17 
     18 Our API differs from textbook heap algorithms as follows:
     19 
     20 - We use 0-based indexing.  This makes the relationship between the
     21   index for a node and the indexes for its children slightly less
     22   obvious, but is more suitable since Python uses 0-based indexing.
     23 
     24 - Our heappop() method returns the smallest item, not the largest.
     25 
     26 These two make it possible to view the heap as a regular Python list
     27 without surprises: heap[0] is the smallest item, and heap.sort()
     28 maintains the heap invariant!
     29 """
     30 
     31 # Original code by Kevin O'Connor, augmented by Tim Peters and Raymond Hettinger
     32 
     33 __about__ = """Heap queues
     34 
     35 [explanation by Franois Pinard]
     36 
     37 Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for
     38 all k, counting elements from 0.  For the sake of comparison,
     39 non-existing elements are considered to be infinite.  The interesting
     40 property of a heap is that a[0] is always its smallest element.
     41 
     42 The strange invariant above is meant to be an efficient memory
     43 representation for a tournament.  The numbers below are `k', not a[k]:
     44 
     45                                    0
     46 
     47                   1                                 2
     48 
     49           3               4                5               6
     50 
     51       7       8       9       10      11      12      13      14
     52 
     53     15 16   17 18   19 20   21 22   23 24   25 26   27 28   29 30
     54 
     55 
     56 In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'.  In
     57 a usual binary tournament we see in sports, each cell is the winner
     58 over the two cells it tops, and we can trace the winner down the tree
     59 to see all opponents s/he had.  However, in many computer applications
     60 of such tournaments, we do not need to trace the history of a winner.
     61 To be more memory efficient, when a winner is promoted, we try to
     62 replace it by something else at a lower level, and the rule becomes
     63 that a cell and the two cells it tops contain three different items,
     64 but the top cell "wins" over the two topped cells.
     65 
     66 If this heap invariant is protected at all time, index 0 is clearly
     67 the overall winner.  The simplest algorithmic way to remove it and
     68 find the "next" winner is to move some loser (let's say cell 30 in the
     69 diagram above) into the 0 position, and then percolate this new 0 down
     70 the tree, exchanging values, until the invariant is re-established.
     71 This is clearly logarithmic on the total number of items in the tree.
     72 By iterating over all items, you get an O(n ln n) sort.
     73 
     74 A nice feature of this sort is that you can efficiently insert new
     75 items while the sort is going on, provided that the inserted items are
     76 not "better" than the last 0'th element you extracted.  This is
     77 especially useful in simulation contexts, where the tree holds all
     78 incoming events, and the "win" condition means the smallest scheduled
     79 time.  When an event schedule other events for execution, they are
     80 scheduled into the future, so they can easily go into the heap.  So, a
     81 heap is a good structure for implementing schedulers (this is what I
     82 used for my MIDI sequencer :-).
     83 
     84 Various structures for implementing schedulers have been extensively
     85 studied, and heaps are good for this, as they are reasonably speedy,
     86 the speed is almost constant, and the worst case is not much different
     87 than the average case.  However, there are other representations which
     88 are more efficient overall, yet the worst cases might be terrible.
     89 
     90 Heaps are also very useful in big disk sorts.  You most probably all
     91 know that a big sort implies producing "runs" (which are pre-sorted
     92 sequences, which size is usually related to the amount of CPU memory),
     93 followed by a merging passes for these runs, which merging is often
     94 very cleverly organised[1].  It is very important that the initial
     95 sort produces the longest runs possible.  Tournaments are a good way
     96 to that.  If, using all the memory available to hold a tournament, you
     97 replace and percolate items that happen to fit the current run, you'll
     98 produce runs which are twice the size of the memory for random input,
     99 and much better for input fuzzily ordered.
    100 
    101 Moreover, if you output the 0'th item on disk and get an input which
    102 may not fit in the current tournament (because the value "wins" over
    103 the last output value), it cannot fit in the heap, so the size of the
    104 heap decreases.  The freed memory could be cleverly reused immediately
    105 for progressively building a second heap, which grows at exactly the
    106 same rate the first heap is melting.  When the first heap completely
    107 vanishes, you switch heaps and start a new run.  Clever and quite
    108 effective!
    109 
    110 In a word, heaps are useful memory structures to know.  I use them in
    111 a few applications, and I think it is good to keep a `heap' module
    112 around. :-)
    113 
    114 --------------------
    115 [1] The disk balancing algorithms which are current, nowadays, are
    116 more annoying than clever, and this is a consequence of the seeking
    117 capabilities of the disks.  On devices which cannot seek, like big
    118 tape drives, the story was quite different, and one had to be very
    119 clever to ensure (far in advance) that each tape movement will be the
    120 most effective possible (that is, will best participate at
    121 "progressing" the merge).  Some tapes were even able to read
    122 backwards, and this was also used to avoid the rewinding time.
    123 Believe me, real good tape sorts were quite spectacular to watch!
    124 From all times, sorting has always been a Great Art! :-)
    125 """
    126 
    127 __all__ = ['heappush', 'heappop', 'heapify', 'heapreplace', 'merge',
    128            'nlargest', 'nsmallest', 'heappushpop']
    129 
    130 def heappush(heap, item):
    131     """Push item onto heap, maintaining the heap invariant."""
    132     heap.append(item)
    133     _siftdown(heap, 0, len(heap)-1)
    134 
    135 def heappop(heap):
    136     """Pop the smallest item off the heap, maintaining the heap invariant."""
    137     lastelt = heap.pop()    # raises appropriate IndexError if heap is empty
    138     if heap:
    139         returnitem = heap[0]
    140         heap[0] = lastelt
    141         _siftup(heap, 0)
    142         return returnitem
    143     return lastelt
    144 
    145 def heapreplace(heap, item):
    146     """Pop and return the current smallest value, and add the new item.
    147 
    148     This is more efficient than heappop() followed by heappush(), and can be
    149     more appropriate when using a fixed-size heap.  Note that the value
    150     returned may be larger than item!  That constrains reasonable uses of
    151     this routine unless written as part of a conditional replacement:
    152 
    153         if item > heap[0]:
    154             item = heapreplace(heap, item)
    155     """
    156     returnitem = heap[0]    # raises appropriate IndexError if heap is empty
    157     heap[0] = item
    158     _siftup(heap, 0)
    159     return returnitem
    160 
    161 def heappushpop(heap, item):
    162     """Fast version of a heappush followed by a heappop."""
    163     if heap and heap[0] < item:
    164         item, heap[0] = heap[0], item
    165         _siftup(heap, 0)
    166     return item
    167 
    168 def heapify(x):
    169     """Transform list into a heap, in-place, in O(len(x)) time."""
    170     n = len(x)
    171     # Transform bottom-up.  The largest index there's any point to looking at
    172     # is the largest with a child index in-range, so must have 2*i + 1 < n,
    173     # or i < (n-1)/2.  If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
    174     # j-1 is the largest, which is n//2 - 1.  If n is odd = 2*j+1, this is
    175     # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
    176     for i in reversed(range(n//2)):
    177         _siftup(x, i)
    178 
    179 def _heappop_max(heap):
    180     """Maxheap version of a heappop."""
    181     lastelt = heap.pop()    # raises appropriate IndexError if heap is empty
    182     if heap:
    183         returnitem = heap[0]
    184         heap[0] = lastelt
    185         _siftup_max(heap, 0)
    186         return returnitem
    187     return lastelt
    188 
    189 def _heapreplace_max(heap, item):
    190     """Maxheap version of a heappop followed by a heappush."""
    191     returnitem = heap[0]    # raises appropriate IndexError if heap is empty
    192     heap[0] = item
    193     _siftup_max(heap, 0)
    194     return returnitem
    195 
    196 def _heapify_max(x):
    197     """Transform list into a maxheap, in-place, in O(len(x)) time."""
    198     n = len(x)
    199     for i in reversed(range(n//2)):
    200         _siftup_max(x, i)
    201 
    202 # 'heap' is a heap at all indices >= startpos, except possibly for pos.  pos
    203 # is the index of a leaf with a possibly out-of-order value.  Restore the
    204 # heap invariant.
    205 def _siftdown(heap, startpos, pos):
    206     newitem = heap[pos]
    207     # Follow the path to the root, moving parents down until finding a place
    208     # newitem fits.
    209     while pos > startpos:
    210         parentpos = (pos - 1) >> 1
    211         parent = heap[parentpos]
    212         if newitem < parent:
    213             heap[pos] = parent
    214             pos = parentpos
    215             continue
    216         break
    217     heap[pos] = newitem
    218 
    219 # The child indices of heap index pos are already heaps, and we want to make
    220 # a heap at index pos too.  We do this by bubbling the smaller child of
    221 # pos up (and so on with that child's children, etc) until hitting a leaf,
    222 # then using _siftdown to move the oddball originally at index pos into place.
    223 #
    224 # We *could* break out of the loop as soon as we find a pos where newitem <=
    225 # both its children, but turns out that's not a good idea, and despite that
    226 # many books write the algorithm that way.  During a heap pop, the last array
    227 # element is sifted in, and that tends to be large, so that comparing it
    228 # against values starting from the root usually doesn't pay (= usually doesn't
    229 # get us out of the loop early).  See Knuth, Volume 3, where this is
    230 # explained and quantified in an exercise.
    231 #
    232 # Cutting the # of comparisons is important, since these routines have no
    233 # way to extract "the priority" from an array element, so that intelligence
    234 # is likely to be hiding in custom comparison methods, or in array elements
    235 # storing (priority, record) tuples.  Comparisons are thus potentially
    236 # expensive.
    237 #
    238 # On random arrays of length 1000, making this change cut the number of
    239 # comparisons made by heapify() a little, and those made by exhaustive
    240 # heappop() a lot, in accord with theory.  Here are typical results from 3
    241 # runs (3 just to demonstrate how small the variance is):
    242 #
    243 # Compares needed by heapify     Compares needed by 1000 heappops
    244 # --------------------------     --------------------------------
    245 # 1837 cut to 1663               14996 cut to 8680
    246 # 1855 cut to 1659               14966 cut to 8678
    247 # 1847 cut to 1660               15024 cut to 8703
    248 #
    249 # Building the heap by using heappush() 1000 times instead required
    250 # 2198, 2148, and 2219 compares:  heapify() is more efficient, when
    251 # you can use it.
    252 #
    253 # The total compares needed by list.sort() on the same lists were 8627,
    254 # 8627, and 8632 (this should be compared to the sum of heapify() and
    255 # heappop() compares):  list.sort() is (unsurprisingly!) more efficient
    256 # for sorting.
    257 
    258 def _siftup(heap, pos):
    259     endpos = len(heap)
    260     startpos = pos
    261     newitem = heap[pos]
    262     # Bubble up the smaller child until hitting a leaf.
    263     childpos = 2*pos + 1    # leftmost child position
    264     while childpos < endpos:
    265         # Set childpos to index of smaller child.
    266         rightpos = childpos + 1
    267         if rightpos < endpos and not heap[childpos] < heap[rightpos]:
    268             childpos = rightpos
    269         # Move the smaller child up.
    270         heap[pos] = heap[childpos]
    271         pos = childpos
    272         childpos = 2*pos + 1
    273     # The leaf at pos is empty now.  Put newitem there, and bubble it up
    274     # to its final resting place (by sifting its parents down).
    275     heap[pos] = newitem
    276     _siftdown(heap, startpos, pos)
    277 
    278 def _siftdown_max(heap, startpos, pos):
    279     'Maxheap variant of _siftdown'
    280     newitem = heap[pos]
    281     # Follow the path to the root, moving parents down until finding a place
    282     # newitem fits.
    283     while pos > startpos:
    284         parentpos = (pos - 1) >> 1
    285         parent = heap[parentpos]
    286         if parent < newitem:
    287             heap[pos] = parent
    288             pos = parentpos
    289             continue
    290         break
    291     heap[pos] = newitem
    292 
    293 def _siftup_max(heap, pos):
    294     'Maxheap variant of _siftup'
    295     endpos = len(heap)
    296     startpos = pos
    297     newitem = heap[pos]
    298     # Bubble up the larger child until hitting a leaf.
    299     childpos = 2*pos + 1    # leftmost child position
    300     while childpos < endpos:
    301         # Set childpos to index of larger child.
    302         rightpos = childpos + 1
    303         if rightpos < endpos and not heap[rightpos] < heap[childpos]:
    304             childpos = rightpos
    305         # Move the larger child up.
    306         heap[pos] = heap[childpos]
    307         pos = childpos
    308         childpos = 2*pos + 1
    309     # The leaf at pos is empty now.  Put newitem there, and bubble it up
    310     # to its final resting place (by sifting its parents down).
    311     heap[pos] = newitem
    312     _siftdown_max(heap, startpos, pos)
    313 
    314 def merge(*iterables, key=None, reverse=False):
    315     '''Merge multiple sorted inputs into a single sorted output.
    316 
    317     Similar to sorted(itertools.chain(*iterables)) but returns a generator,
    318     does not pull the data into memory all at once, and assumes that each of
    319     the input streams is already sorted (smallest to largest).
    320 
    321     >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
    322     [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
    323 
    324     If *key* is not None, applies a key function to each element to determine
    325     its sort order.
    326 
    327     >>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))
    328     ['dog', 'cat', 'fish', 'horse', 'kangaroo']
    329 
    330     '''
    331 
    332     h = []
    333     h_append = h.append
    334 
    335     if reverse:
    336         _heapify = _heapify_max
    337         _heappop = _heappop_max
    338         _heapreplace = _heapreplace_max
    339         direction = -1
    340     else:
    341         _heapify = heapify
    342         _heappop = heappop
    343         _heapreplace = heapreplace
    344         direction = 1
    345 
    346     if key is None:
    347         for order, it in enumerate(map(iter, iterables)):
    348             try:
    349                 next = it.__next__
    350                 h_append([next(), order * direction, next])
    351             except StopIteration:
    352                 pass
    353         _heapify(h)
    354         while len(h) > 1:
    355             try:
    356                 while True:
    357                     value, order, next = s = h[0]
    358                     yield value
    359                     s[0] = next()           # raises StopIteration when exhausted
    360                     _heapreplace(h, s)      # restore heap condition
    361             except StopIteration:
    362                 _heappop(h)                 # remove empty iterator
    363         if h:
    364             # fast case when only a single iterator remains
    365             value, order, next = h[0]
    366             yield value
    367             yield from next.__self__
    368         return
    369 
    370     for order, it in enumerate(map(iter, iterables)):
    371         try:
    372             next = it.__next__
    373             value = next()
    374             h_append([key(value), order * direction, value, next])
    375         except StopIteration:
    376             pass
    377     _heapify(h)
    378     while len(h) > 1:
    379         try:
    380             while True:
    381                 key_value, order, value, next = s = h[0]
    382                 yield value
    383                 value = next()
    384                 s[0] = key(value)
    385                 s[2] = value
    386                 _heapreplace(h, s)
    387         except StopIteration:
    388             _heappop(h)
    389     if h:
    390         key_value, order, value, next = h[0]
    391         yield value
    392         yield from next.__self__
    393 
    394 
    395 # Algorithm notes for nlargest() and nsmallest()
    396 # ==============================================
    397 #
    398 # Make a single pass over the data while keeping the k most extreme values
    399 # in a heap.  Memory consumption is limited to keeping k values in a list.
    400 #
    401 # Measured performance for random inputs:
    402 #
    403 #                                   number of comparisons
    404 #    n inputs     k-extreme values  (average of 5 trials)   % more than min()
    405 # -------------   ----------------  ---------------------   -----------------
    406 #      1,000           100                  3,317               231.7%
    407 #     10,000           100                 14,046                40.5%
    408 #    100,000           100                105,749                 5.7%
    409 #  1,000,000           100              1,007,751                 0.8%
    410 # 10,000,000           100             10,009,401                 0.1%
    411 #
    412 # Theoretical number of comparisons for k smallest of n random inputs:
    413 #
    414 # Step   Comparisons                  Action
    415 # ----   --------------------------   ---------------------------
    416 #  1     1.66 * k                     heapify the first k-inputs
    417 #  2     n - k                        compare remaining elements to top of heap
    418 #  3     k * (1 + lg2(k)) * ln(n/k)   replace the topmost value on the heap
    419 #  4     k * lg2(k) - (k/2)           final sort of the k most extreme values
    420 #
    421 # Combining and simplifying for a rough estimate gives:
    422 #
    423 #        comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k))
    424 #
    425 # Computing the number of comparisons for step 3:
    426 # -----------------------------------------------
    427 # * For the i-th new value from the iterable, the probability of being in the
    428 #   k most extreme values is k/i.  For example, the probability of the 101st
    429 #   value seen being in the 100 most extreme values is 100/101.
    430 # * If the value is a new extreme value, the cost of inserting it into the
    431 #   heap is 1 + log(k, 2).
    432 # * The probability times the cost gives:
    433 #            (k/i) * (1 + log(k, 2))
    434 # * Summing across the remaining n-k elements gives:
    435 #            sum((k/i) * (1 + log(k, 2)) for i in range(k+1, n+1))
    436 # * This reduces to:
    437 #            (H(n) - H(k)) * k * (1 + log(k, 2))
    438 # * Where H(n) is the n-th harmonic number estimated by:
    439 #            gamma = 0.5772156649
    440 #            H(n) = log(n, e) + gamma + 1 / (2 * n)
    441 #   http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#Rate_of_divergence
    442 # * Substituting the H(n) formula:
    443 #            comparisons = k * (1 + log(k, 2)) * (log(n/k, e) + (1/n - 1/k) / 2)
    444 #
    445 # Worst-case for step 3:
    446 # ----------------------
    447 # In the worst case, the input data is reversed sorted so that every new element
    448 # must be inserted in the heap:
    449 #
    450 #             comparisons = 1.66 * k + log(k, 2) * (n - k)
    451 #
    452 # Alternative Algorithms
    453 # ----------------------
    454 # Other algorithms were not used because they:
    455 # 1) Took much more auxiliary memory,
    456 # 2) Made multiple passes over the data.
    457 # 3) Made more comparisons in common cases (small k, large n, semi-random input).
    458 # See the more detailed comparison of approach at:
    459 # http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest
    460 
    461 def nsmallest(n, iterable, key=None):
    462     """Find the n smallest elements in a dataset.
    463 
    464     Equivalent to:  sorted(iterable, key=key)[:n]
    465     """
    466 
    467     # Short-cut for n==1 is to use min()
    468     if n == 1:
    469         it = iter(iterable)
    470         sentinel = object()
    471         if key is None:
    472             result = min(it, default=sentinel)
    473         else:
    474             result = min(it, default=sentinel, key=key)
    475         return [] if result is sentinel else [result]
    476 
    477     # When n>=size, it's faster to use sorted()
    478     try:
    479         size = len(iterable)
    480     except (TypeError, AttributeError):
    481         pass
    482     else:
    483         if n >= size:
    484             return sorted(iterable, key=key)[:n]
    485 
    486     # When key is none, use simpler decoration
    487     if key is None:
    488         it = iter(iterable)
    489         # put the range(n) first so that zip() doesn't
    490         # consume one too many elements from the iterator
    491         result = [(elem, i) for i, elem in zip(range(n), it)]
    492         if not result:
    493             return result
    494         _heapify_max(result)
    495         top = result[0][0]
    496         order = n
    497         _heapreplace = _heapreplace_max
    498         for elem in it:
    499             if elem < top:
    500                 _heapreplace(result, (elem, order))
    501                 top = result[0][0]
    502                 order += 1
    503         result.sort()
    504         return [r[0] for r in result]
    505 
    506     # General case, slowest method
    507     it = iter(iterable)
    508     result = [(key(elem), i, elem) for i, elem in zip(range(n), it)]
    509     if not result:
    510         return result
    511     _heapify_max(result)
    512     top = result[0][0]
    513     order = n
    514     _heapreplace = _heapreplace_max
    515     for elem in it:
    516         k = key(elem)
    517         if k < top:
    518             _heapreplace(result, (k, order, elem))
    519             top = result[0][0]
    520             order += 1
    521     result.sort()
    522     return [r[2] for r in result]
    523 
    524 def nlargest(n, iterable, key=None):
    525     """Find the n largest elements in a dataset.
    526 
    527     Equivalent to:  sorted(iterable, key=key, reverse=True)[:n]
    528     """
    529 
    530     # Short-cut for n==1 is to use max()
    531     if n == 1:
    532         it = iter(iterable)
    533         sentinel = object()
    534         if key is None:
    535             result = max(it, default=sentinel)
    536         else:
    537             result = max(it, default=sentinel, key=key)
    538         return [] if result is sentinel else [result]
    539 
    540     # When n>=size, it's faster to use sorted()
    541     try:
    542         size = len(iterable)
    543     except (TypeError, AttributeError):
    544         pass
    545     else:
    546         if n >= size:
    547             return sorted(iterable, key=key, reverse=True)[:n]
    548 
    549     # When key is none, use simpler decoration
    550     if key is None:
    551         it = iter(iterable)
    552         result = [(elem, i) for i, elem in zip(range(0, -n, -1), it)]
    553         if not result:
    554             return result
    555         heapify(result)
    556         top = result[0][0]
    557         order = -n
    558         _heapreplace = heapreplace
    559         for elem in it:
    560             if top < elem:
    561                 _heapreplace(result, (elem, order))
    562                 top = result[0][0]
    563                 order -= 1
    564         result.sort(reverse=True)
    565         return [r[0] for r in result]
    566 
    567     # General case, slowest method
    568     it = iter(iterable)
    569     result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
    570     if not result:
    571         return result
    572     heapify(result)
    573     top = result[0][0]
    574     order = -n
    575     _heapreplace = heapreplace
    576     for elem in it:
    577         k = key(elem)
    578         if top < k:
    579             _heapreplace(result, (k, order, elem))
    580             top = result[0][0]
    581             order -= 1
    582     result.sort(reverse=True)
    583     return [r[2] for r in result]
    584 
    585 # If available, use C implementation
    586 try:
    587     from _heapq import *
    588 except ImportError:
    589     pass
    590 try:
    591     from _heapq import _heapreplace_max
    592 except ImportError:
    593     pass
    594 try:
    595     from _heapq import _heapify_max
    596 except ImportError:
    597     pass
    598 try:
    599     from _heapq import _heappop_max
    600 except ImportError:
    601     pass
    602 
    603 
    604 if __name__ == "__main__":
    605 
    606     import doctest
    607     print(doctest.testmod())
    608