Home | History | Annotate | Download | only in library
      1 :mod:`heapq` --- Heap queue algorithm
      2 =====================================
      3 
      4 .. module:: heapq
      5    :synopsis: Heap queue algorithm (a.k.a. priority queue).
      6 .. moduleauthor:: Kevin O'Connor
      7 .. sectionauthor:: Guido van Rossum <guido (a] python.org>
      8 .. sectionauthor:: Franois Pinard
      9 .. sectionauthor:: Raymond Hettinger
     10 
     11 .. versionadded:: 2.3
     12 
     13 **Source code:** :source:`Lib/heapq.py`
     14 
     15 --------------
     16 
     17 This module provides an implementation of the heap queue algorithm, also known
     18 as the priority queue algorithm.
     19 
     20 Heaps are binary trees for which every parent node has a value less than or
     21 equal to any of its children.  This implementation uses arrays for which
     22 ``heap[k] <= heap[2*k+1]`` and ``heap[k] <= heap[2*k+2]`` for all *k*, counting
     23 elements from zero.  For the sake of comparison, non-existing elements are
     24 considered to be infinite.  The interesting property of a heap is that its
     25 smallest element is always the root, ``heap[0]``.
     26 
     27 The API below differs from textbook heap algorithms in two aspects: (a) We use
     28 zero-based indexing.  This makes the relationship between the index for a node
     29 and the indexes for its children slightly less obvious, but is more suitable
     30 since Python uses zero-based indexing. (b) Our pop method returns the smallest
     31 item, not the largest (called a "min heap" in textbooks; a "max heap" is more
     32 common in texts because of its suitability for in-place sorting).
     33 
     34 These two make it possible to view the heap as a regular Python list without
     35 surprises: ``heap[0]`` is the smallest item, and ``heap.sort()`` maintains the
     36 heap invariant!
     37 
     38 To create a heap, use a list initialized to ``[]``, or you can transform a
     39 populated list into a heap via function :func:`heapify`.
     40 
     41 The following functions are provided:
     42 
     43 
     44 .. function:: heappush(heap, item)
     45 
     46    Push the value *item* onto the *heap*, maintaining the heap invariant.
     47 
     48 
     49 .. function:: heappop(heap)
     50 
     51    Pop and return the smallest item from the *heap*, maintaining the heap
     52    invariant.  If the heap is empty, :exc:`IndexError` is raised.  To access the
     53    smallest item without popping it, use ``heap[0]``.
     54 
     55 .. function:: heappushpop(heap, item)
     56 
     57    Push *item* on the heap, then pop and return the smallest item from the
     58    *heap*.  The combined action runs more efficiently than :func:`heappush`
     59    followed by a separate call to :func:`heappop`.
     60 
     61    .. versionadded:: 2.6
     62 
     63 .. function:: heapify(x)
     64 
     65    Transform list *x* into a heap, in-place, in linear time.
     66 
     67 
     68 .. function:: heapreplace(heap, item)
     69 
     70    Pop and return the smallest item from the *heap*, and also push the new *item*.
     71    The heap size doesn't change. If the heap is empty, :exc:`IndexError` is raised.
     72 
     73    This one step operation is more efficient than a :func:`heappop` followed by
     74    :func:`heappush` and can be more appropriate when using a fixed-size heap.
     75    The pop/push combination always returns an element from the heap and replaces
     76    it with *item*.
     77 
     78    The value returned may be larger than the *item* added.  If that isn't
     79    desired, consider using :func:`heappushpop` instead.  Its push/pop
     80    combination returns the smaller of the two values, leaving the larger value
     81    on the heap.
     82 
     83 
     84 The module also offers three general purpose functions based on heaps.
     85 
     86 
     87 .. function:: merge(*iterables)
     88 
     89    Merge multiple sorted inputs into a single sorted output (for example, merge
     90    timestamped entries from multiple log files).  Returns an :term:`iterator`
     91    over the sorted values.
     92 
     93    Similar to ``sorted(itertools.chain(*iterables))`` but returns an iterable, does
     94    not pull the data into memory all at once, and assumes that each of the input
     95    streams is already sorted (smallest to largest).
     96 
     97    .. versionadded:: 2.6
     98 
     99 
    100 .. function:: nlargest(n, iterable[, key])
    101 
    102    Return a list with the *n* largest elements from the dataset defined by
    103    *iterable*.  *key*, if provided, specifies a function of one argument that is
    104    used to extract a comparison key from each element in the iterable:
    105    ``key=str.lower`` Equivalent to:  ``sorted(iterable, key=key,
    106    reverse=True)[:n]``
    107 
    108    .. versionadded:: 2.4
    109 
    110    .. versionchanged:: 2.5
    111       Added the optional *key* argument.
    112 
    113 
    114 .. function:: nsmallest(n, iterable[, key])
    115 
    116    Return a list with the *n* smallest elements from the dataset defined by
    117    *iterable*.  *key*, if provided, specifies a function of one argument that is
    118    used to extract a comparison key from each element in the iterable:
    119    ``key=str.lower`` Equivalent to:  ``sorted(iterable, key=key)[:n]``
    120 
    121    .. versionadded:: 2.4
    122 
    123    .. versionchanged:: 2.5
    124       Added the optional *key* argument.
    125 
    126 The latter two functions perform best for smaller values of *n*.  For larger
    127 values, it is more efficient to use the :func:`sorted` function.  Also, when
    128 ``n==1``, it is more efficient to use the built-in :func:`min` and :func:`max`
    129 functions.  If repeated usage of these functions is required, consider turning
    130 the iterable into an actual heap.
    131 
    132 
    133 Basic Examples
    134 --------------
    135 
    136 A `heapsort <https://en.wikipedia.org/wiki/Heapsort>`_ can be implemented by
    137 pushing all values onto a heap and then popping off the smallest values one at a
    138 time::
    139 
    140    >>> def heapsort(iterable):
    141    ...     h = []
    142    ...     for value in iterable:
    143    ...         heappush(h, value)
    144    ...     return [heappop(h) for i in range(len(h))]
    145    ...
    146    >>> heapsort([1, 3, 5, 7, 9, 2, 4, 6, 8, 0])
    147    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    148 
    149 This is similar to ``sorted(iterable)``, but unlike :func:`sorted`, this
    150 implementation is not stable.
    151 
    152 Heap elements can be tuples.  This is useful for assigning comparison values
    153 (such as task priorities) alongside the main record being tracked::
    154 
    155     >>> h = []
    156     >>> heappush(h, (5, 'write code'))
    157     >>> heappush(h, (7, 'release product'))
    158     >>> heappush(h, (1, 'write spec'))
    159     >>> heappush(h, (3, 'create tests'))
    160     >>> heappop(h)
    161     (1, 'write spec')
    162 
    163 
    164 Priority Queue Implementation Notes
    165 -----------------------------------
    166 
    167 A `priority queue <https://en.wikipedia.org/wiki/Priority_queue>`_ is common use
    168 for a heap, and it presents several implementation challenges:
    169 
    170 * Sort stability:  how do you get two tasks with equal priorities to be returned
    171   in the order they were originally added?
    172 
    173 * In the future with Python 3, tuple comparison breaks for (priority, task)
    174   pairs if the priorities are equal and the tasks do not have a default
    175   comparison order.
    176 
    177 * If the priority of a task changes, how do you move it to a new position in
    178   the heap?
    179 
    180 * Or if a pending task needs to be deleted, how do you find it and remove it
    181   from the queue?
    182 
    183 A solution to the first two challenges is to store entries as 3-element list
    184 including the priority, an entry count, and the task.  The entry count serves as
    185 a tie-breaker so that two tasks with the same priority are returned in the order
    186 they were added. And since no two entry counts are the same, the tuple
    187 comparison will never attempt to directly compare two tasks.
    188 
    189 The remaining challenges revolve around finding a pending task and making
    190 changes to its priority or removing it entirely.  Finding a task can be done
    191 with a dictionary pointing to an entry in the queue.
    192 
    193 Removing the entry or changing its priority is more difficult because it would
    194 break the heap structure invariants.  So, a possible solution is to mark the
    195 existing entry as removed and add a new entry with the revised priority::
    196 
    197     pq = []                         # list of entries arranged in a heap
    198     entry_finder = {}               # mapping of tasks to entries
    199     REMOVED = '<removed-task>'      # placeholder for a removed task
    200     counter = itertools.count()     # unique sequence count
    201 
    202     def add_task(task, priority=0):
    203         'Add a new task or update the priority of an existing task'
    204         if task in entry_finder:
    205             remove_task(task)
    206         count = next(counter)
    207         entry = [priority, count, task]
    208         entry_finder[task] = entry
    209         heappush(pq, entry)
    210 
    211     def remove_task(task):
    212         'Mark an existing task as REMOVED.  Raise KeyError if not found.'
    213         entry = entry_finder.pop(task)
    214         entry[-1] = REMOVED
    215 
    216     def pop_task():
    217         'Remove and return the lowest priority task. Raise KeyError if empty.'
    218         while pq:
    219             priority, count, task = heappop(pq)
    220             if task is not REMOVED:
    221                 del entry_finder[task]
    222                 return task
    223         raise KeyError('pop from an empty priority queue')
    224 
    225 
    226 Theory
    227 ------
    228 
    229 Heaps are arrays for which ``a[k] <= a[2*k+1]`` and ``a[k] <= a[2*k+2]`` for all
    230 *k*, counting elements from 0.  For the sake of comparison, non-existing
    231 elements are considered to be infinite.  The interesting property of a heap is
    232 that ``a[0]`` is always its smallest element.
    233 
    234 The strange invariant above is meant to be an efficient memory representation
    235 for a tournament.  The numbers below are *k*, not ``a[k]``::
    236 
    237                                   0
    238 
    239                  1                                 2
    240 
    241          3               4                5               6
    242 
    243      7       8       9       10      11      12      13      14
    244 
    245    15 16   17 18   19 20   21 22   23 24   25 26   27 28   29 30
    246 
    247 In the tree above, each cell *k* is topping ``2*k+1`` and ``2*k+2``. In a usual
    248 binary tournament we see in sports, each cell is the winner over the two cells
    249 it tops, and we can trace the winner down the tree to see all opponents s/he
    250 had.  However, in many computer applications of such tournaments, we do not need
    251 to trace the history of a winner. To be more memory efficient, when a winner is
    252 promoted, we try to replace it by something else at a lower level, and the rule
    253 becomes that a cell and the two cells it tops contain three different items, but
    254 the top cell "wins" over the two topped cells.
    255 
    256 If this heap invariant is protected at all time, index 0 is clearly the overall
    257 winner.  The simplest algorithmic way to remove it and find the "next" winner is
    258 to move some loser (let's say cell 30 in the diagram above) into the 0 position,
    259 and then percolate this new 0 down the tree, exchanging values, until the
    260 invariant is re-established. This is clearly logarithmic on the total number of
    261 items in the tree. By iterating over all items, you get an O(n log n) sort.
    262 
    263 A nice feature of this sort is that you can efficiently insert new items while
    264 the sort is going on, provided that the inserted items are not "better" than the
    265 last 0'th element you extracted.  This is especially useful in simulation
    266 contexts, where the tree holds all incoming events, and the "win" condition
    267 means the smallest scheduled time.  When an event schedules other events for
    268 execution, they are scheduled into the future, so they can easily go into the
    269 heap.  So, a heap is a good structure for implementing schedulers (this is what
    270 I used for my MIDI sequencer :-).
    271 
    272 Various structures for implementing schedulers have been extensively studied,
    273 and heaps are good for this, as they are reasonably speedy, the speed is almost
    274 constant, and the worst case is not much different than the average case.
    275 However, there are other representations which are more efficient overall, yet
    276 the worst cases might be terrible.
    277 
    278 Heaps are also very useful in big disk sorts.  You most probably all know that a
    279 big sort implies producing "runs" (which are pre-sorted sequences, whose size is
    280 usually related to the amount of CPU memory), followed by a merging passes for
    281 these runs, which merging is often very cleverly organised [#]_. It is very
    282 important that the initial sort produces the longest runs possible.  Tournaments
    283 are a good way to achieve that.  If, using all the memory available to hold a
    284 tournament, you replace and percolate items that happen to fit the current run,
    285 you'll produce runs which are twice the size of the memory for random input, and
    286 much better for input fuzzily ordered.
    287 
    288 Moreover, if you output the 0'th item on disk and get an input which may not fit
    289 in the current tournament (because the value "wins" over the last output value),
    290 it cannot fit in the heap, so the size of the heap decreases.  The freed memory
    291 could be cleverly reused immediately for progressively building a second heap,
    292 which grows at exactly the same rate the first heap is melting.  When the first
    293 heap completely vanishes, you switch heaps and start a new run.  Clever and
    294 quite effective!
    295 
    296 In a word, heaps are useful memory structures to know.  I use them in a few
    297 applications, and I think it is good to keep a 'heap' module around. :-)
    298 
    299 .. rubric:: Footnotes
    300 
    301 .. [#] The disk balancing algorithms which are current, nowadays, are more annoying
    302    than clever, and this is a consequence of the seeking capabilities of the disks.
    303    On devices which cannot seek, like big tape drives, the story was quite
    304    different, and one had to be very clever to ensure (far in advance) that each
    305    tape movement will be the most effective possible (that is, will best
    306    participate at "progressing" the merge).  Some tapes were even able to read
    307    backwards, and this was also used to avoid the rewinding time. Believe me, real
    308    good tape sorts were quite spectacular to watch! From all times, sorting has
    309    always been a Great Art! :-)
    310 
    311