Home | History | Annotate | Download | only in library
      1 :mod:`statistics` --- Mathematical statistics functions
      2 =======================================================
      3 
      4 .. module:: statistics
      5    :synopsis: mathematical statistics functions
      6 
      7 .. moduleauthor:: Steven D'Aprano <steve+python (a] pearwood.info>
      8 .. sectionauthor:: Steven D'Aprano <steve+python (a] pearwood.info>
      9 
     10 .. versionadded:: 3.4
     11 
     12 **Source code:** :source:`Lib/statistics.py`
     13 
     14 .. testsetup:: *
     15 
     16    from statistics import *
     17    __name__ = '<doctest>'
     18 
     19 --------------
     20 
     21 This module provides functions for calculating mathematical statistics of
     22 numeric (:class:`Real`-valued) data.
     23 
     24 .. note::
     25 
     26    Unless explicitly noted otherwise, these functions support :class:`int`,
     27    :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
     28    Behaviour with other types (whether in the numeric tower or not) is
     29    currently unsupported.  Mixed types are also undefined and
     30    implementation-dependent.  If your input data consists of mixed types,
     31    you may be able to use :func:`map` to ensure a consistent result, e.g.
     32    ``map(float, input_data)``.
     33 
     34 Averages and measures of central location
     35 -----------------------------------------
     36 
     37 These functions calculate an average or typical value from a population
     38 or sample.
     39 
     40 =======================  =============================================
     41 :func:`mean`             Arithmetic mean ("average") of data.
     42 :func:`harmonic_mean`    Harmonic mean of data.
     43 :func:`median`           Median (middle value) of data.
     44 :func:`median_low`       Low median of data.
     45 :func:`median_high`      High median of data.
     46 :func:`median_grouped`   Median, or 50th percentile, of grouped data.
     47 :func:`mode`             Mode (most common value) of discrete data.
     48 =======================  =============================================
     49 
     50 Measures of spread
     51 ------------------
     52 
     53 These functions calculate a measure of how much the population or sample
     54 tends to deviate from the typical or average values.
     55 
     56 =======================  =============================================
     57 :func:`pstdev`           Population standard deviation of data.
     58 :func:`pvariance`        Population variance of data.
     59 :func:`stdev`            Sample standard deviation of data.
     60 :func:`variance`         Sample variance of data.
     61 =======================  =============================================
     62 
     63 
     64 Function details
     65 ----------------
     66 
     67 Note: The functions do not require the data given to them to be sorted.
     68 However, for reading convenience, most of the examples show sorted sequences.
     69 
     70 .. function:: mean(data)
     71 
     72    Return the sample arithmetic mean of *data* which can be a sequence or iterator.
     73 
     74    The arithmetic mean is the sum of the data divided by the number of data
     75    points.  It is commonly called "the average", although it is only one of many
     76    different mathematical averages.  It is a measure of the central location of
     77    the data.
     78 
     79    If *data* is empty, :exc:`StatisticsError` will be raised.
     80 
     81    Some examples of use:
     82 
     83    .. doctest::
     84 
     85       >>> mean([1, 2, 3, 4, 4])
     86       2.8
     87       >>> mean([-1.0, 2.5, 3.25, 5.75])
     88       2.625
     89 
     90       >>> from fractions import Fraction as F
     91       >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
     92       Fraction(13, 21)
     93 
     94       >>> from decimal import Decimal as D
     95       >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
     96       Decimal('0.5625')
     97 
     98    .. note::
     99 
    100       The mean is strongly affected by outliers and is not a robust estimator
    101       for central location: the mean is not necessarily a typical example of the
    102       data points.  For more robust, although less efficient, measures of
    103       central location, see :func:`median` and :func:`mode`.  (In this case,
    104       "efficient" refers to statistical efficiency rather than computational
    105       efficiency.)
    106 
    107       The sample mean gives an unbiased estimate of the true population mean,
    108       which means that, taken on average over all the possible samples,
    109       ``mean(sample)`` converges on the true mean of the entire population.  If
    110       *data* represents the entire population rather than a sample, then
    111       ``mean(data)`` is equivalent to calculating the true population mean .
    112 
    113 
    114 .. function:: harmonic_mean(data)
    115 
    116    Return the harmonic mean of *data*, a sequence or iterator of
    117    real-valued numbers.
    118 
    119    The harmonic mean, sometimes called the subcontrary mean, is the
    120    reciprocal of the arithmetic :func:`mean` of the reciprocals of the
    121    data. For example, the harmonic mean of three values *a*, *b* and *c*
    122    will be equivalent to ``3/(1/a + 1/b + 1/c)``.
    123 
    124    The harmonic mean is a type of average, a measure of the central
    125    location of the data.  It is often appropriate when averaging quantities
    126    which are rates or ratios, for example speeds. For example:
    127 
    128    Suppose an investor purchases an equal value of shares in each of
    129    three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
    130    What is the average P/E ratio for the investor's portfolio?
    131 
    132    .. doctest::
    133 
    134       >>> harmonic_mean([2.5, 3, 10])  # For an equal investment portfolio.
    135       3.6
    136 
    137    Using the arithmetic mean would give an average of about 5.167, which
    138    is too high.
    139 
    140    :exc:`StatisticsError` is raised if *data* is empty, or any element
    141    is less than zero.
    142 
    143    .. versionadded:: 3.6
    144 
    145 
    146 .. function:: median(data)
    147 
    148    Return the median (middle value) of numeric data, using the common "mean of
    149    middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
    150    *data* can be a sequence or iterator.
    151 
    152    The median is a robust measure of central location, and is less affected by
    153    the presence of outliers in your data.  When the number of data points is
    154    odd, the middle data point is returned:
    155 
    156    .. doctest::
    157 
    158       >>> median([1, 3, 5])
    159       3
    160 
    161    When the number of data points is even, the median is interpolated by taking
    162    the average of the two middle values:
    163 
    164    .. doctest::
    165 
    166       >>> median([1, 3, 5, 7])
    167       4.0
    168 
    169    This is suited for when your data is discrete, and you don't mind that the
    170    median may not be an actual data point.
    171 
    172    If your data is ordinal (supports order operations) but not numeric (doesn't
    173    support addition), you should use :func:`median_low` or :func:`median_high`
    174    instead.
    175 
    176    .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
    177 
    178 
    179 .. function:: median_low(data)
    180 
    181    Return the low median of numeric data.  If *data* is empty,
    182    :exc:`StatisticsError` is raised.  *data* can be a sequence or iterator.
    183 
    184    The low median is always a member of the data set.  When the number of data
    185    points is odd, the middle value is returned.  When it is even, the smaller of
    186    the two middle values is returned.
    187 
    188    .. doctest::
    189 
    190       >>> median_low([1, 3, 5])
    191       3
    192       >>> median_low([1, 3, 5, 7])
    193       3
    194 
    195    Use the low median when your data are discrete and you prefer the median to
    196    be an actual data point rather than interpolated.
    197 
    198 
    199 .. function:: median_high(data)
    200 
    201    Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
    202    is raised.  *data* can be a sequence or iterator.
    203 
    204    The high median is always a member of the data set.  When the number of data
    205    points is odd, the middle value is returned.  When it is even, the larger of
    206    the two middle values is returned.
    207 
    208    .. doctest::
    209 
    210       >>> median_high([1, 3, 5])
    211       3
    212       >>> median_high([1, 3, 5, 7])
    213       5
    214 
    215    Use the high median when your data are discrete and you prefer the median to
    216    be an actual data point rather than interpolated.
    217 
    218 
    219 .. function:: median_grouped(data, interval=1)
    220 
    221    Return the median of grouped continuous data, calculated as the 50th
    222    percentile, using interpolation.  If *data* is empty, :exc:`StatisticsError`
    223    is raised.  *data* can be a sequence or iterator.
    224 
    225    .. doctest::
    226 
    227       >>> median_grouped([52, 52, 53, 54])
    228       52.5
    229 
    230    In the following example, the data are rounded, so that each value represents
    231    the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
    232    is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc.  With the data
    233    given, the middle value falls somewhere in the class 3.5--4.5, and
    234    interpolation is used to estimate it:
    235 
    236    .. doctest::
    237 
    238       >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
    239       3.7
    240 
    241    Optional argument *interval* represents the class interval, and defaults
    242    to 1.  Changing the class interval naturally will change the interpolation:
    243 
    244    .. doctest::
    245 
    246       >>> median_grouped([1, 3, 3, 5, 7], interval=1)
    247       3.25
    248       >>> median_grouped([1, 3, 3, 5, 7], interval=2)
    249       3.5
    250 
    251    This function does not check whether the data points are at least
    252    *interval* apart.
    253 
    254    .. impl-detail::
    255 
    256       Under some circumstances, :func:`median_grouped` may coerce data points to
    257       floats.  This behaviour is likely to change in the future.
    258 
    259    .. seealso::
    260 
    261       * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
    262         Larry B Wallnau (8th Edition).
    263 
    264       * The `SSMEDIAN
    265         <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
    266         function in the Gnome Gnumeric spreadsheet, including `this discussion
    267         <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
    268 
    269 
    270 .. function:: mode(data)
    271 
    272    Return the most common data point from discrete or nominal *data*.  The mode
    273    (when it exists) is the most typical value, and is a robust measure of
    274    central location.
    275 
    276    If *data* is empty, or if there is not exactly one most common value,
    277    :exc:`StatisticsError` is raised.
    278 
    279    ``mode`` assumes discrete data, and returns a single value. This is the
    280    standard treatment of the mode as commonly taught in schools:
    281 
    282    .. doctest::
    283 
    284       >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
    285       3
    286 
    287    The mode is unique in that it is the only statistic which also applies
    288    to nominal (non-numeric) data:
    289 
    290    .. doctest::
    291 
    292       >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
    293       'red'
    294 
    295 
    296 .. function:: pstdev(data, mu=None)
    297 
    298    Return the population standard deviation (the square root of the population
    299    variance).  See :func:`pvariance` for arguments and other details.
    300 
    301    .. doctest::
    302 
    303       >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
    304       0.986893273527251
    305 
    306 
    307 .. function:: pvariance(data, mu=None)
    308 
    309    Return the population variance of *data*, a non-empty iterable of real-valued
    310    numbers.  Variance, or second moment about the mean, is a measure of the
    311    variability (spread or dispersion) of data.  A large variance indicates that
    312    the data is spread out; a small variance indicates it is clustered closely
    313    around the mean.
    314 
    315    If the optional second argument *mu* is given, it should be the mean of
    316    *data*.  If it is missing or ``None`` (the default), the mean is
    317    automatically calculated.
    318 
    319    Use this function to calculate the variance from the entire population.  To
    320    estimate the variance from a sample, the :func:`variance` function is usually
    321    a better choice.
    322 
    323    Raises :exc:`StatisticsError` if *data* is empty.
    324 
    325    Examples:
    326 
    327    .. doctest::
    328 
    329       >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
    330       >>> pvariance(data)
    331       1.25
    332 
    333    If you have already calculated the mean of your data, you can pass it as the
    334    optional second argument *mu* to avoid recalculation:
    335 
    336    .. doctest::
    337 
    338       >>> mu = mean(data)
    339       >>> pvariance(data, mu)
    340       1.25
    341 
    342    This function does not attempt to verify that you have passed the actual mean
    343    as *mu*.  Using arbitrary values for *mu* may lead to invalid or impossible
    344    results.
    345 
    346    Decimals and Fractions are supported:
    347 
    348    .. doctest::
    349 
    350       >>> from decimal import Decimal as D
    351       >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
    352       Decimal('24.815')
    353 
    354       >>> from fractions import Fraction as F
    355       >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
    356       Fraction(13, 72)
    357 
    358    .. note::
    359 
    360       When called with the entire population, this gives the population variance
    361       .  When called on a sample instead, this is the biased sample variance
    362       s, also known as variance with N degrees of freedom.
    363 
    364       If you somehow know the true population mean , you may use this function
    365       to calculate the variance of a sample, giving the known population mean as
    366       the second argument.  Provided the data points are representative
    367       (e.g. independent and identically distributed), the result will be an
    368       unbiased estimate of the population variance.
    369 
    370 
    371 .. function:: stdev(data, xbar=None)
    372 
    373    Return the sample standard deviation (the square root of the sample
    374    variance).  See :func:`variance` for arguments and other details.
    375 
    376    .. doctest::
    377 
    378       >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
    379       1.0810874155219827
    380 
    381 
    382 .. function:: variance(data, xbar=None)
    383 
    384    Return the sample variance of *data*, an iterable of at least two real-valued
    385    numbers.  Variance, or second moment about the mean, is a measure of the
    386    variability (spread or dispersion) of data.  A large variance indicates that
    387    the data is spread out; a small variance indicates it is clustered closely
    388    around the mean.
    389 
    390    If the optional second argument *xbar* is given, it should be the mean of
    391    *data*.  If it is missing or ``None`` (the default), the mean is
    392    automatically calculated.
    393 
    394    Use this function when your data is a sample from a population. To calculate
    395    the variance from the entire population, see :func:`pvariance`.
    396 
    397    Raises :exc:`StatisticsError` if *data* has fewer than two values.
    398 
    399    Examples:
    400 
    401    .. doctest::
    402 
    403       >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
    404       >>> variance(data)
    405       1.3720238095238095
    406 
    407    If you have already calculated the mean of your data, you can pass it as the
    408    optional second argument *xbar* to avoid recalculation:
    409 
    410    .. doctest::
    411 
    412       >>> m = mean(data)
    413       >>> variance(data, m)
    414       1.3720238095238095
    415 
    416    This function does not attempt to verify that you have passed the actual mean
    417    as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
    418    impossible results.
    419 
    420    Decimal and Fraction values are supported:
    421 
    422    .. doctest::
    423 
    424       >>> from decimal import Decimal as D
    425       >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
    426       Decimal('31.01875')
    427 
    428       >>> from fractions import Fraction as F
    429       >>> variance([F(1, 6), F(1, 2), F(5, 3)])
    430       Fraction(67, 108)
    431 
    432    .. note::
    433 
    434       This is the sample variance s with Bessel's correction, also known as
    435       variance with N-1 degrees of freedom.  Provided that the data points are
    436       representative (e.g. independent and identically distributed), the result
    437       should be an unbiased estimate of the true population variance.
    438 
    439       If you somehow know the actual population mean  you should pass it to the
    440       :func:`pvariance` function as the *mu* parameter to get the variance of a
    441       sample.
    442 
    443 Exceptions
    444 ----------
    445 
    446 A single exception is defined:
    447 
    448 .. exception:: StatisticsError
    449 
    450    Subclass of :exc:`ValueError` for statistics-related exceptions.
    451 
    452 ..
    453    # This modelines must appear within the last ten lines of the file.
    454    kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
    455