Home | History | Annotate | Download | only in Objects
      1 Intro
      2 -----
      3 This describes an adaptive, stable, natural mergesort, modestly called
      4 timsort (hey, I earned it <wink>).  It has supernatural performance on many
      5 kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
      6 as few as N-1), yet as fast as Python's previous highly tuned samplesort
      7 hybrid on random arrays.
      8 
      9 In a nutshell, the main routine marches over the array once, left to right,
     10 alternately identifying the next run, then merging it into the previous
     11 runs "intelligently".  Everything else is complication for speed, and some
     12 hard-won measure of memory efficiency.
     13 
     14 
     15 Comparison with Python's Samplesort Hybrid
     16 ------------------------------------------
     17 + timsort can require a temp array containing as many as N//2 pointers,
     18   which means as many as 2*N extra bytes on 32-bit boxes.  It can be
     19   expected to require a temp array this large when sorting random data; on
     20   data with significant structure, it may get away without using any extra
     21   heap memory.  This appears to be the strongest argument against it, but
     22   compared to the size of an object, 2 temp bytes worst-case (also expected-
     23   case for random data) doesn't scare me much.
     24 
     25   It turns out that Perl is moving to a stable mergesort, and the code for
     26   that appears always to require a temp array with room for at least N
     27   pointers. (Note that I wouldn't want to do that even if space weren't an
     28   issue; I believe its efforts at memory frugality also save timsort
     29   significant pointer-copying costs, and allow it to have a smaller working
     30   set.)
     31 
     32 + Across about four hours of generating random arrays, and sorting them
     33   under both methods, samplesort required about 1.5% more comparisons
     34   (the program is at the end of this file).
     35 
     36 + In real life, this may be faster or slower on random arrays than
     37   samplesort was, depending on platform quirks.  Since it does fewer
     38   comparisons on average, it can be expected to do better the more
     39   expensive a comparison function is.  OTOH, it does more data movement
     40   (pointer copying) than samplesort, and that may negate its small
     41   comparison advantage (depending on platform quirks) unless comparison
     42   is very expensive.
     43 
     44 + On arrays with many kinds of pre-existing order, this blows samplesort out
     45   of the water.  It's significantly faster than samplesort even on some
     46   cases samplesort was special-casing the snot out of.  I believe that lists
     47   very often do have exploitable partial order in real life, and this is the
     48   strongest argument in favor of timsort (indeed, samplesort's special cases
     49   for extreme partial order are appreciated by real users, and timsort goes
     50   much deeper than those, in particular naturally covering every case where
     51   someone has suggested "and it would be cool if list.sort() had a special
     52   case for this too ... and for that ...").
     53 
     54 + Here are exact comparison counts across all the tests in sortperf.py,
     55   when run with arguments "15 20 1".
     56 
     57   Column Key:
     58       *sort: random data
     59       \sort: descending data
     60       /sort: ascending data
     61       3sort: ascending, then 3 random exchanges
     62       +sort: ascending, then 10 random at the end
     63       %sort: ascending, then randomly replace 1% of elements w/ random values
     64       ~sort: many duplicates
     65       =sort: all equal
     66       !sort: worst case scenario
     67 
     68   First the trivial cases, trivial for samplesort because it special-cased
     69   them, and trivial for timsort because it naturally works on runs.  Within
     70   an "n" block, the first line gives the # of compares done by samplesort,
     71   the second line by timsort, and the third line is the percentage by
     72   which the samplesort count exceeds the timsort count:
     73 
     74       n   \sort   /sort   =sort
     75 -------  ------  ------  ------
     76   32768   32768   32767   32767  samplesort
     77           32767   32767   32767  timsort
     78           0.00%   0.00%   0.00%  (samplesort - timsort) / timsort
     79 
     80   65536   65536   65535   65535
     81           65535   65535   65535
     82           0.00%   0.00%   0.00%
     83 
     84  131072  131072  131071  131071
     85          131071  131071  131071
     86           0.00%   0.00%   0.00%
     87 
     88  262144  262144  262143  262143
     89          262143  262143  262143
     90           0.00%   0.00%   0.00%
     91 
     92  524288  524288  524287  524287
     93          524287  524287  524287
     94           0.00%   0.00%   0.00%
     95 
     96 1048576 1048576 1048575 1048575
     97         1048575 1048575 1048575
     98           0.00%   0.00%   0.00%
     99 
    100   The algorithms are effectively identical in these cases, except that
    101   timsort does one less compare in \sort.
    102 
    103   Now for the more interesting cases.  Where lg(x) is the logarithm of x to
    104   the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
    105   the best any comparison-based sorting algorithm can do on average (across
    106   all permutations).  When a method gets significantly below that, it's
    107   either astronomically lucky, or is finding exploitable structure in the
    108   data.
    109 
    110 
    111       n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
    112 -------  -------   ------   -------  -------  ------  -------  --------
    113   32768   444255   453096   453614    32908   452871   130491    469141 old
    114                    448885    33016    33007    50426   182083     65534 new
    115                     0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new
    116 
    117   65536   954037   972699   981940    65686   973104   260029   1004607
    118                    962991    65821    65808   101667   364341    131070
    119                     1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%
    120 
    121  131072  2039137  2101881  2091491   131232  2092894   554790   2161379
    122                   2057533   131410   131361   206193   728871    262142
    123                     2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%
    124 
    125  262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
    126                   4377402   262437   262459   416347  1457945    524286
    127                     1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%
    128 
    129  524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
    130                   9278734   524580   524633   837947  2916107   1048574
    131                    1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%
    132 
    133 1048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
    134                  19606028  1048958  1048941  1694896  5832445   2097150
    135                     1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
    136 
    137   Discussion of cases:
    138 
    139   *sort:  There's no structure in random data to exploit, so the theoretical
    140   limit is lg(n!).  Both methods get close to that, and timsort is hugging
    141   it (indeed, in a *marginal* sense, it's a spectacular improvement --
    142   there's only about 1% left before hitting the wall, and timsort knows
    143   darned well it's doing compares that won't pay on random data -- but so
    144   does the samplesort hybrid).  For contrast, Hoare's original random-pivot
    145   quicksort does about 39% more compares than the limit, and the median-of-3
    146   variant about 19% more.
    147 
    148   3sort, %sort, and !sort:  No contest; there's structure in this data, but
    149   not of the specific kinds samplesort special-cases.  Note that structure
    150   in !sort wasn't put there on purpose -- it was crafted as a worst case for
    151   a previous quicksort implementation.  That timsort nails it came as a
    152   surprise to me (although it's obvious in retrospect).
    153 
    154   +sort:  samplesort special-cases this data, and does a few less compares
    155   than timsort.  However, timsort runs this case significantly faster on all
    156   boxes we have timings for, because timsort is in the business of merging
    157   runs efficiently, while samplesort does much more data movement in this
    158   (for it) special case.
    159 
    160   ~sort:  samplesort's special cases for large masses of equal elements are
    161   extremely effective on ~sort's specific data pattern, and timsort just
    162   isn't going to get close to that, despite that it's clearly getting a
    163   great deal of benefit out of the duplicates (the # of compares is much less
    164   than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
    165   distinct values, and as the distribution gets more skewed, samplesort's
    166   equal-element gimmicks become less effective, while timsort's adaptive
    167   strategies find more to exploit; in a database supplied by Kevin Altis, a
    168   sort on its highly skewed "on which stock exchange does this company's
    169   stock trade?" field ran over twice as fast under timsort.
    170 
    171   However, despite that timsort does many more comparisons on ~sort, and
    172   that on several platforms ~sort runs highly significantly slower under
    173   timsort, on other platforms ~sort runs highly significantly faster under
    174   timsort.  No other kind of data has shown this wild x-platform behavior,
    175   and we don't have an explanation for it.  The only thing I can think of
    176   that could transform what "should be" highly significant slowdowns into
    177   highly significant speedups on some boxes are catastrophic cache effects
    178   in samplesort.
    179 
    180   But timsort "should be" slower than samplesort on ~sort, so it's hard
    181   to count that it isn't on some boxes as a strike against it <wink>.
    182 
    183 + Here's the highwater mark for the number of heap-based temp slots (4
    184   bytes each on this box) needed by each test, again with arguments
    185   "15 20 1":
    186 
    187    2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
    188   32768  16384     0     0   6256      0  10821  12288      0  16383
    189   65536  32766     0     0  21652      0  31276  24576      0  32767
    190  131072  65534     0     0  17258      0  58112  49152      0  65535
    191  262144 131072     0     0  35660      0 123561  98304      0 131071
    192  524288 262142     0     0  31302      0 212057 196608      0 262143
    193 1048576 524286     0     0 312438      0 484942 393216      0 524287
    194 
    195   Discussion:  The tests that end up doing (close to) perfectly balanced
    196   merges (*sort, !sort) need all N//2 temp slots (or almost all).  ~sort
    197   also ends up doing balanced merges, but systematically benefits a lot from
    198   the preliminary pre-merge searches described under "Merge Memory" later.
    199   %sort approaches having a balanced merge at the end because the random
    200   selection of elements to replace is expected to produce an out-of-order
    201   element near the midpoint.  \sort, /sort, =sort are the trivial one-run
    202   cases, needing no merging at all.  +sort ends up having one very long run
    203   and one very short, and so gets all the temp space it needs from the small
    204   temparray member of the MergeState struct (note that the same would be
    205   true if the new random elements were prefixed to the sorted list instead,
    206   but not if they appeared "in the middle").  3sort approaches N//3 temp
    207   slots twice, but the run lengths that remain after 3 random exchanges
    208   clearly has very high variance.
    209 
    210 
    211 A detailed description of timsort follows.
    212 
    213 Runs
    214 ----
    215 count_run() returns the # of elements in the next run.  A run is either
    216 "ascending", which means non-decreasing:
    217 
    218     a0 <= a1 <= a2 <= ...
    219 
    220 or "descending", which means strictly decreasing:
    221 
    222     a0 > a1 > a2 > ...
    223 
    224 Note that a run is always at least 2 long, unless we start at the array's
    225 last element.
    226 
    227 The definition of descending is strict, because the main routine reverses
    228 a descending run in-place, transforming a descending run into an ascending
    229 run.  Reversal is done via the obvious fast "swap elements starting at each
    230 end, and converge at the middle" method, and that can violate stability if
    231 the slice contains any equal elements.  Using a strict definition of
    232 descending ensures that a descending run contains distinct elements.
    233 
    234 If an array is random, it's very unlikely we'll see long runs.  If a natural
    235 run contains less than minrun elements (see next section), the main loop
    236 artificially boosts it to minrun elements, via a stable binary insertion sort
    237 applied to the right number of array elements following the short natural
    238 run.  In a random array, *all* runs are likely to be minrun long as a
    239 result.  This has two primary good effects:
    240 
    241 1. Random data strongly tends then toward perfectly balanced (both runs have
    242    the same length) merges, which is the most efficient way to proceed when
    243    data is random.
    244 
    245 2. Because runs are never very short, the rest of the code doesn't make
    246    heroic efforts to shave a few cycles off per-merge overheads.  For
    247    example, reasonable use of function calls is made, rather than trying to
    248    inline everything.  Since there are no more than N/minrun runs to begin
    249    with, a few "extra" function calls per merge is barely measurable.
    250 
    251 
    252 Computing minrun
    253 ----------------
    254 If N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
    255 array then; it's hard to beat that given the overheads of trying something
    256 fancier (see note BINSORT).
    257 
    258 When N is a power of 2, testing on random data showed that minrun values of
    259 16, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
    260 in binary insertion sort clearly hurt, and at 8 the increase in the number
    261 of function calls clearly hurt.  Picking *some* power of 2 is important
    262 here, so that the merges end up perfectly balanced (see next section).  We
    263 pick 32 as a good value in the sweet range; picking a value at the low end
    264 allows the adaptive gimmicks more opportunity to exploit shorter natural
    265 runs.
    266 
    267 Because sortperf.py only tries powers of 2, it took a long time to notice
    268 that 32 isn't a good choice for the general case!  Consider N=2112:
    269 
    270 >>> divmod(2112, 32)
    271 (66, 0)
    272 >>>
    273 
    274 If the data is randomly ordered, we're very likely to end up with 66 runs
    275 each of length 32.  The first 64 of these trigger a sequence of perfectly
    276 balanced merges (see next section), leaving runs of lengths 2048 and 64 to
    277 merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
    278 compares, but it's still more compares than necessary, and-- mergesort's
    279 bugaboo relative to samplesort --a lot more data movement (O(N) copies just
    280 to get 64 elements into place).
    281 
    282 If we take minrun=33 in this case, then we're very likely to end up with 64
    283 runs each of length 33, and then all merges are perfectly balanced.  Better!
    284 
    285 What we want to avoid is picking minrun such that in
    286 
    287     q, r = divmod(N, minrun)
    288 
    289 q is a power of 2 and r>0 (then the last merge only gets r elements into
    290 place, and r < minrun is small compared to N), or q a little larger than a
    291 power of 2 regardless of r (then we've got a case similar to "2112", again
    292 leaving too little work for the last merge to do).
    293 
    294 Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
    295 power of 2, or if that isn't possible, is close to, but strictly less than,
    296 a power of 2.  This is easier to do than it may sound:  take the first 6
    297 bits of N, and add 1 if any of the remaining bits are set.  In fact, that
    298 rule covers every case in this section, including small N and exact powers
    299 of 2; merge_compute_minrun() is a deceptively simple function.
    300 
    301 
    302 The Merge Pattern
    303 -----------------
    304 In order to exploit regularities in the data, we're merging on natural
    305 run lengths, and they can become wildly unbalanced.  That's a Good Thing
    306 for this sort!  It means we have to find a way to manage an assortment of
    307 potentially very different run lengths, though.
    308 
    309 Stability constrains permissible merging patterns.  For example, if we have
    310 3 consecutive runs of lengths
    311 
    312     A:10000  B:20000  C:10000
    313 
    314 we dare not merge A with C first, because if A, B and C happen to contain
    315 a common element, it would get out of order wrt its occurrence(s) in B.  The
    316 merging must be done as (A+B)+C or A+(B+C) instead.
    317 
    318 So merging is always done on two consecutive runs at a time, and in-place,
    319 although this may require some temp memory (more on that later).
    320 
    321 When a run is identified, its base address and length are pushed on a stack
    322 in the MergeState struct.  merge_collapse() is then called to see whether it
    323 should merge it with preceding run(s).  We would like to delay merging as
    324 long as possible in order to exploit patterns that may come up later, but we
    325 like even more to do merging as soon as possible to exploit that the run just
    326 found is still high in the memory hierarchy.  We also can't delay merging
    327 "too long" because it consumes memory to remember the runs that are still
    328 unmerged, and the stack has a fixed size.
    329 
    330 What turned out to be a good compromise maintains two invariants on the
    331 stack entries, where A, B and C are the lengths of the three righmost not-yet
    332 merged slices:
    333 
    334 1.  A > B+C
    335 2.  B > C
    336 
    337 Note that, by induction, #2 implies the lengths of pending runs form a
    338 decreasing sequence.  #1 implies that, reading the lengths right to left,
    339 the pending-run lengths grow at least as fast as the Fibonacci numbers.
    340 Therefore the stack can never grow larger than about log_base_phi(N) entries,
    341 where phi = (1+sqrt(5))/2 ~= 1.618.  Thus a small # of stack slots suffice
    342 for very large arrays.
    343 
    344 If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
    345 freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
    346 e.g., if the last 3 entries are
    347 
    348     A:30  B:20  C:10
    349 
    350 then B is merged with C, leaving
    351 
    352     A:30  BC:30
    353 
    354 on the stack.  Or if they were
    355 
    356     A:500  B:400:  C:1000
    357 
    358 then A is merged with B, leaving
    359 
    360     AB:900  C:1000
    361 
    362 on the stack.
    363 
    364 In both examples, the stack configuration after the merge still violates
    365 invariant #2, and merge_collapse() goes on to continue merging runs until
    366 both invariants are satisfied.  As an extreme case, suppose we didn't do the
    367 minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
    368 and 2.  Nothing would get merged until the final 2 was seen, and that would
    369 trigger 7 perfectly balanced merges.
    370 
    371 The thrust of these rules when they trigger merging is to balance the run
    372 lengths as closely as possible, while keeping a low bound on the number of
    373 runs we have to remember.  This is maximally effective for random data,
    374 where all runs are likely to be of (artificially forced) length minrun, and
    375 then we get a sequence of perfectly balanced merges (with, perhaps, some
    376 oddballs at the end).
    377 
    378 OTOH, one reason this sort is so good for partly ordered data has to do
    379 with wildly unbalanced run lengths.
    380 
    381 
    382 Merge Memory
    383 ------------
    384 Merging adjacent runs of lengths A and B in-place, and in linear time, is
    385 difficult.  Theoretical constructions are known that can do it, but they're
    386 too difficult and slow for practical use.  But if we have temp memory equal
    387 to min(A, B), it's easy.
    388 
    389 If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
    390 and then we can do the obvious merge algorithm left to right, from the temp
    391 area and B, starting the stores into where A used to live.  There's always a
    392 free area in the original area comprising a number of elements equal to the
    393 number not yet merged from the temp array (trivially true at the start;
    394 proceed by induction).  The only tricky bit is that if a comparison raises an
    395 exception, we have to remember to copy the remaining elements back in from
    396 the temp area, lest the array end up with duplicate entries from B.  But
    397 that's exactly the same thing we need to do if we reach the end of B first,
    398 so the exit code is pleasantly common to both the normal and error cases.
    399 
    400 If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
    401 much the same, except that we need to merge right to left, copying B into a
    402 temp array and starting the stores at the right end of where B used to live.
    403 
    404 A refinement:  When we're about to merge adjacent runs A and B, we first do
    405 a form of binary search (more on that later) to see where B[0] should end up
    406 in A.  Elements in A preceding that point are already in their final
    407 positions, effectively shrinking the size of A.  Likewise we also search to
    408 see where A[-1] should end up in B, and elements of B after that point can
    409 also be ignored.  This cuts the amount of temp memory needed by the same
    410 amount.
    411 
    412 These preliminary searches may not pay off, and can be expected *not* to
    413 repay their cost if the data is random.  But they can win huge in all of
    414 time, copying, and memory savings when they do pay, so this is one of the
    415 "per-merge overheads" mentioned above that we're happy to endure because
    416 there is at most one very short run.  It's generally true in this algorithm
    417 that we're willing to gamble a little to win a lot, even though the net
    418 expectation is negative for random data.
    419 
    420 
    421 Merge Algorithms
    422 ----------------
    423 merge_lo() and merge_hi() are where the bulk of the time is spent.  merge_lo
    424 deals with runs where A <= B, and merge_hi where A > B.  They don't know
    425 whether the data is clustered or uniform, but a lovely thing about merging
    426 is that many kinds of clustering "reveal themselves" by how many times in a
    427 row the winning merge element comes from the same run.  We'll only discuss
    428 merge_lo here; merge_hi is exactly analogous.
    429 
    430 Merging begins in the usual, obvious way, comparing the first element of A
    431 to the first of B, and moving B[0] to the merge area if it's less than A[0],
    432 else moving A[0] to the merge area.  Call that the "one pair at a time"
    433 mode.  The only twist here is keeping track of how many times in a row "the
    434 winner" comes from the same run.
    435 
    436 If that count reaches MIN_GALLOP, we switch to "galloping mode".  Here
    437 we *search* B for where A[0] belongs, and move over all the B's before
    438 that point in one chunk to the merge area, then move A[0] to the merge
    439 area.  Then we search A for where B[0] belongs, and similarly move a
    440 slice of A in one chunk.  Then back to searching B for where A[0] belongs,
    441 etc.  We stay in galloping mode until both searches find slices to copy
    442 less than MIN_GALLOP elements long, at which point we go back to one-pair-
    443 at-a-time mode.
    444 
    445 A refinement:  The MergeState struct contains the value of min_gallop that
    446 controls when we enter galloping mode, initialized to MIN_GALLOP.
    447 merge_lo() and merge_hi() adjust this higher when galloping isn't paying
    448 off, and lower when it is.
    449 
    450 
    451 Galloping
    452 ---------
    453 Still without loss of generality, assume A is the shorter run.  In galloping
    454 mode, we first look for A[0] in B.  We do this via "galloping", comparing
    455 A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
    456 the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1].  This takes at most
    457 roughly lg(B) comparisons, and, unlike a straight binary search, favors
    458 finding the right spot early in B (more on that later).
    459 
    460 After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
    461 consecutive elements, and a straight binary search requires exactly k-1
    462 additional comparisons to nail it (see note REGION OF UNCERTAINTY).  Then we
    463 copy all the B's up to that point in one chunk, and then copy A[0].  Note
    464 that no matter where A[0] belongs in B, the combination of galloping + binary
    465 search finds it in no more than about 2*lg(B) comparisons.
    466 
    467 If we did a straight binary search, we could find it in no more than
    468 ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
    469 comparisons no matter where A[0] belongs.  Straight binary search thus loses
    470 to galloping unless the run is quite long, and we simply can't guess
    471 whether it is in advance.
    472 
    473 If data is random and runs have the same length, A[0] belongs at B[0] half
    474 the time, at B[1] a quarter of the time, and so on:  a consecutive winning
    475 sub-run in B of length k occurs with probability 1/2**(k+1).  So long
    476 winning sub-runs are extremely unlikely in random data, and guessing that a
    477 winning sub-run is going to be long is a dangerous game.
    478 
    479 OTOH, if data is lopsided or lumpy or contains many duplicates, long
    480 stretches of winning sub-runs are very likely, and cutting the number of
    481 comparisons needed to find one from O(B) to O(log B) is a huge win.
    482 
    483 Galloping compromises by getting out fast if there isn't a long winning
    484 sub-run, yet finding such very efficiently when they exist.
    485 
    486 I first learned about the galloping strategy in a related context; see:
    487 
    488     "Adaptive Set Intersections, Unions, and Differences" (2000)
    489     Erik D. Demaine, Alejandro Lpez-Ortiz, J. Ian Munro
    490 
    491 and its followup(s).  An earlier paper called the same strategy
    492 "exponential search":
    493 
    494    "Optimistic Sorting and Information Theoretic Complexity"
    495    Peter McIlroy
    496    SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
    497    467-474, Austin, Texas, 25-27 January 1993.
    498 
    499 and it probably dates back to an earlier paper by Bentley and Yao.  The
    500 McIlroy paper in particular has good analysis of a mergesort that's
    501 probably strongly related to this one in its galloping strategy.
    502 
    503 
    504 Galloping with a Broken Leg
    505 ---------------------------
    506 So why don't we always gallop?  Because it can lose, on two counts:
    507 
    508 1. While we're willing to endure small per-merge overheads, per-comparison
    509    overheads are a different story.  Calling Yet Another Function per
    510    comparison is expensive, and gallop_left() and gallop_right() are
    511    too long-winded for sane inlining.
    512 
    513 2. Galloping can-- alas --require more comparisons than linear one-at-time
    514    search, depending on the data.
    515 
    516 #2 requires details.  If A[0] belongs before B[0], galloping requires 1
    517 compare to determine that, same as linear search, except it costs more
    518 to call the gallop function.  If A[0] belongs right before B[1], galloping
    519 requires 2 compares, again same as linear search.  On the third compare,
    520 galloping checks A[0] against B[3], and if it's <=, requires one more
    521 compare to determine whether A[0] belongs at B[2] or B[3].  That's a total
    522 of 4 compares, but if A[0] does belong at B[2], linear search would have
    523 discovered that in only 3 compares, and that's a huge loss!  Really.  It's
    524 an increase of 33% in the number of compares needed, and comparisons are
    525 expensive in Python.
    526 
    527 index in B where    # compares linear  # gallop  # binary  gallop
    528 A[0] belongs        search needs       compares  compares  total
    529 ----------------    -----------------  --------  --------  ------
    530                0                    1         1         0       1
    531 
    532                1                    2         2         0       2
    533 
    534                2                    3         3         1       4
    535                3                    4         3         1       4
    536 
    537                4                    5         4         2       6
    538                5                    6         4         2       6
    539                6                    7         4         2       6
    540                7                    8         4         2       6
    541 
    542                8                    9         5         3       8
    543                9                   10         5         3       8
    544               10                   11         5         3       8
    545               11                   12         5         3       8
    546                                         ...
    547 
    548 In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
    549 to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
    550 The advantage of galloping is unbounded as i grows, but it doesn't win at
    551 all until i=6.  Before then, it loses twice (at i=2 and i=4), and ties
    552 at the other values.  At and after i=6, galloping always wins.
    553 
    554 We can't guess in advance when it's going to win, though, so we do one pair
    555 at a time until the evidence seems strong that galloping may pay.  MIN_GALLOP
    556 is 7, and that's pretty strong evidence.  However, if the data is random, it
    557 simply will trigger galloping mode purely by luck every now and again, and
    558 it's quite likely to hit one of the losing cases next.  On the other hand,
    559 in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
    560 "should be" then.  So the MergeState struct keeps a min_gallop variable
    561 that merge_lo and merge_hi adjust:  the longer we stay in galloping mode,
    562 the smaller min_gallop gets, making it easier to transition back to
    563 galloping mode (if we ever leave it in the current merge, and at the
    564 start of the next merge).  But whenever the gallop loop doesn't pay,
    565 min_gallop is increased by one, making it harder to transition back
    566 to galloping mode (and again both within a merge and across merges).  For
    567 random data, this all but eliminates the gallop penalty:  min_gallop grows
    568 large enough that we almost never get into galloping mode.  And for cases
    569 like ~sort, min_gallop can fall to as low as 1.  This seems to work well,
    570 but in all it's a minor improvement over using a fixed MIN_GALLOP value.
    571 
    572 
    573 Galloping Complication
    574 ----------------------
    575 The description above was for merge_lo.  merge_hi has to merge "from the
    576 other end", and really needs to gallop starting at the last element in a run
    577 instead of the first.  Galloping from the first still works, but does more
    578 comparisons than it should (this is significant -- I timed it both ways). For
    579 this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
    580 functions have a "hint" argument, which is the index at which galloping
    581 should begin.  So galloping can actually start at any index, and proceed at
    582 offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
    583 
    584 In the code as I type it's always called with either 0 or n-1 (where n is
    585 the # of elements in a run).  It's tempting to try to do something fancier,
    586 melding galloping with some form of interpolation search; for example, if
    587 we're merging a run of length 1 with a run of length 10000, index 5000 is
    588 probably a better guess at the final result than either 0 or 9999.  But
    589 it's unclear how to generalize that intuition usefully, and merging of
    590 wildly unbalanced runs already enjoys excellent performance.
    591 
    592 ~sort is a good example of when balanced runs could benefit from a better
    593 hint value:  to the extent possible, this would like to use a starting
    594 offset equal to the previous value of acount/bcount.  Doing so saves about
    595 10% of the compares in ~sort.  However, doing so is also a mixed bag,
    596 hurting other cases.
    597 
    598 
    599 Comparing Average # of Compares on Random Arrays
    600 ------------------------------------------------
    601 [NOTE:  This was done when the new algorithm used about 0.1% more compares
    602  on random data than does its current incarnation.]
    603 
    604 Here list.sort() is samplesort, and list.msort() this sort:
    605 
    606 """
    607 import random
    608 from time import clock as now
    609 
    610 def fill(n):
    611     from random import random
    612     return [random() for i in xrange(n)]
    613 
    614 def mycmp(x, y):
    615     global ncmp
    616     ncmp += 1
    617     return cmp(x, y)
    618 
    619 def timeit(values, method):
    620     global ncmp
    621     X = values[:]
    622     bound = getattr(X, method)
    623     ncmp = 0
    624     t1 = now()
    625     bound(mycmp)
    626     t2 = now()
    627     return t2-t1, ncmp
    628 
    629 format = "%5s  %9.2f  %11d"
    630 f2     = "%5s  %9.2f  %11.2f"
    631 
    632 def drive():
    633     count = sst = sscmp = mst = mscmp = nelts = 0
    634     while True:
    635         n = random.randrange(100000)
    636         nelts += n
    637         x = fill(n)
    638 
    639         t, c = timeit(x, 'sort')
    640         sst += t
    641         sscmp += c
    642 
    643         t, c = timeit(x, 'msort')
    644         mst += t
    645         mscmp += c
    646 
    647         count += 1
    648         if count % 10:
    649             continue
    650 
    651         print "count", count, "nelts", nelts
    652         print format % ("sort",  sst, sscmp)
    653         print format % ("msort", mst, mscmp)
    654         print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
    655 
    656 drive()
    657 """
    658 
    659 I ran this on Windows and kept using the computer lightly while it was
    660 running.  time.clock() is wall-clock time on Windows, with better than
    661 microsecond resolution.  samplesort started with a 1.52% #-of-comparisons
    662 disadvantage, fell quickly to 1.48%, and then fluctuated within that small
    663 range.  Here's the last chunk of output before I killed the job:
    664 
    665 count 2630 nelts 130906543
    666  sort    6110.80   1937887573
    667 msort    6002.78   1909389381
    668             1.80         1.49
    669 
    670 We've done nearly 2 billion comparisons apiece at Python speed there, and
    671 that's enough <wink>.
    672 
    673 For random arrays of size 2 (yes, there are only 2 interesting ones),
    674 samplesort has a 50%(!) comparison disadvantage.  This is a consequence of
    675 samplesort special-casing at most one ascending run at the start, then
    676 falling back to the general case if it doesn't find an ascending run
    677 immediately.  The consequence is that it ends up using two compares to sort
    678 [2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
    679 taught how to deal with mixtures of ascending and descending runs
    680 efficiently in all cases.
    681 
    682 
    683 NOTES
    684 -----
    685 
    686 BINSORT
    687 A "binary insertion sort" is just like a textbook insertion sort, but instead
    688 of locating the correct position of the next item via linear (one at a time)
    689 search, an equivalent to Python's bisect.bisect_right is used to find the
    690 correct position in logarithmic time.  Most texts don't mention this
    691 variation, and those that do usually say it's not worth the bother:  insertion
    692 sort remains quadratic (expected and worst cases) either way.  Speeding the
    693 search doesn't reduce the quadratic data movement costs.
    694 
    695 But in CPython's case, comparisons are extraordinarily expensive compared to
    696 moving data, and the details matter.  Moving objects is just copying
    697 pointers.  Comparisons can be arbitrarily expensive (can invoke arbitary
    698 user-supplied Python code), but even in simple cases (like 3 < 4) _all_
    699 decisions are made at runtime:  what's the type of the left comparand?  the
    700 type of the right?  do they need to be coerced to a common type?  where's the
    701 code to compare these types?  And so on.  Even the simplest Python comparison
    702 triggers a large pile of C-level pointer dereferences, conditionals, and
    703 function calls.
    704 
    705 So cutting the number of compares is almost always measurably helpful in
    706 CPython, and the savings swamp the quadratic-time data movement costs for
    707 reasonable minrun values.
    708 
    709 
    710 LEFT OR RIGHT
    711 gallop_left() and gallop_right() are akin to the Python bisect module's
    712 bisect_left() and bisect_right():  they're the same unless the slice they're
    713 searching contains a (at least one) value equal to the value being searched
    714 for.  In that case, gallop_left() returns the position immediately before the
    715 leftmost equal value, and gallop_right() the position immediately after the
    716 rightmost equal value.  The distinction is needed to preserve stability.  In
    717 general, when merging adjacent runs A and B, gallop_left is used to search
    718 thru B for where an element from A belongs, and gallop_right to search thru A
    719 for where an element from B belongs.
    720 
    721 
    722 REGION OF UNCERTAINTY
    723 Two kinds of confusion seem to be common about the claim that after finding
    724 a k such that
    725 
    726     B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
    727 
    728 then a binary search requires exactly k-1 tries to find A[0]'s proper
    729 location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
    730 
    731 The first confusion takes the form "OK, then the region of uncertainty is at
    732 indices 3, 4, 5, 6 and 7:  that's 5 elements, not the claimed 2**(k-1) - 1 =
    733 3"; or the region is viewed as a Python slice and the objection is "but that's
    734 the slice B[3:7], so has 7-3 = 4 elements".  Resolution:  we've already
    735 compared A[0] against B[3] and against B[7], so A[0]'s correct location is
    736 already known wrt _both_ endpoints.  What remains is to find A[0]'s correct
    737 location wrt B[4], B[5] and B[6], which spans 3 elements.  Or in general, the
    738 slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
    739 inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
    740     (2**k-1)-1 - 2**(k-1) + 1 =
    741     2**k-1 - 2**(k-1) =
    742     2*2**k-1 - 2**(k-1) =
    743     (2-1)*2**(k-1) - 1 =
    744     2**(k-1) - 1
    745 elements.
    746 
    747 The second confusion:  "k-1 = 2 binary searches can find the correct location
    748 among 2**(k-1) = 4 elements, but you're only applying it to 3 elements:  we
    749 could make this more efficient by arranging for the region of uncertainty to
    750 span 2**(k-1) elements."  Resolution:  that confuses "elements" with
    751 "locations".  In a slice with N elements, there are N+1 _locations_.  In the
    752 example, with the region of uncertainty B[4], B[5], B[6], there are 4
    753 locations:  before B[4], between B[4] and B[5], between B[5] and B[6], and
    754 after B[6].  In general, across 2**(k-1)-1 elements, there are 2**(k-1)
    755 locations.  That's why k-1 binary searches are necessary and sufficient.
    756