1 # We did not author this file nor mantain it. Skip linting it. 2 #pylint: skip-file 3 # Copyright (c) 1999-2008 Gary Strangman; All Rights Reserved. 4 # 5 # Permission is hereby granted, free of charge, to any person obtaining a copy 6 # of this software and associated documentation files (the "Software"), to deal 7 # in the Software without restriction, including without limitation the rights 8 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 # copies of the Software, and to permit persons to whom the Software is 10 # furnished to do so, subject to the following conditions: 11 # 12 # The above copyright notice and this permission notice shall be included in 13 # all copies or substantial portions of the Software. 14 # 15 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 # THE SOFTWARE. 22 # 23 # Comments and/or additions are welcome (send e-mail to: 24 # strang (at] nmr.mgh.harvard.edu). 25 # 26 """stats.py module 27 28 (Requires pstat.py module.) 29 30 ################################################# 31 ####### Written by: Gary Strangman ########### 32 ####### Last modified: Oct 31, 2008 ########### 33 ################################################# 34 35 A collection of basic statistical functions for python. The function 36 names appear below. 37 38 IMPORTANT: There are really *3* sets of functions. The first set has an 'l' 39 prefix, which can be used with list or tuple arguments. The second set has 40 an 'a' prefix, which can accept NumPy array arguments. These latter 41 functions are defined only when NumPy is available on the system. The third 42 type has NO prefix (i.e., has the name that appears below). Functions of 43 this set are members of a "Dispatch" class, c/o David Ascher. This class 44 allows different functions to be called depending on the type of the passed 45 arguments. Thus, stats.mean is a member of the Dispatch class and 46 stats.mean(range(20)) will call stats.lmean(range(20)) while 47 stats.mean(Numeric.arange(20)) will call stats.amean(Numeric.arange(20)). 48 This is a handy way to keep consistent function names when different 49 argument types require different functions to be called. Having 50 implementated the Dispatch class, however, means that to get info on 51 a given function, you must use the REAL function name ... that is 52 "print stats.lmean.__doc__" or "print stats.amean.__doc__" work fine, 53 while "print stats.mean.__doc__" will print the doc for the Dispatch 54 class. NUMPY FUNCTIONS ('a' prefix) generally have more argument options 55 but should otherwise be consistent with the corresponding list functions. 56 57 Disclaimers: The function list is obviously incomplete and, worse, the 58 functions are not optimized. All functions have been tested (some more 59 so than others), but they are far from bulletproof. Thus, as with any 60 free software, no warranty or guarantee is expressed or implied. :-) A 61 few extra functions that don't appear in the list below can be found by 62 interested treasure-hunters. These functions don't necessarily have 63 both list and array versions but were deemed useful 64 65 CENTRAL TENDENCY: geometricmean 66 harmonicmean 67 mean 68 median 69 medianscore 70 mode 71 72 MOMENTS: moment 73 variation 74 skew 75 kurtosis 76 skewtest (for Numpy arrays only) 77 kurtosistest (for Numpy arrays only) 78 normaltest (for Numpy arrays only) 79 80 ALTERED VERSIONS: tmean (for Numpy arrays only) 81 tvar (for Numpy arrays only) 82 tmin (for Numpy arrays only) 83 tmax (for Numpy arrays only) 84 tstdev (for Numpy arrays only) 85 tsem (for Numpy arrays only) 86 describe 87 88 FREQUENCY STATS: itemfreq 89 scoreatpercentile 90 percentileofscore 91 histogram 92 cumfreq 93 relfreq 94 95 VARIABILITY: obrientransform 96 samplevar 97 samplestdev 98 signaltonoise (for Numpy arrays only) 99 var 100 stdev 101 sterr 102 sem 103 z 104 zs 105 zmap (for Numpy arrays only) 106 107 TRIMMING FCNS: threshold (for Numpy arrays only) 108 trimboth 109 trim1 110 round (round all vals to 'n' decimals; Numpy only) 111 112 CORRELATION FCNS: covariance (for Numpy arrays only) 113 correlation (for Numpy arrays only) 114 paired 115 pearsonr 116 spearmanr 117 pointbiserialr 118 kendalltau 119 linregress 120 121 INFERENTIAL STATS: ttest_1samp 122 ttest_ind 123 ttest_rel 124 chisquare 125 ks_2samp 126 mannwhitneyu 127 ranksums 128 wilcoxont 129 kruskalwallish 130 friedmanchisquare 131 132 PROBABILITY CALCS: chisqprob 133 erfcc 134 zprob 135 ksprob 136 fprob 137 betacf 138 gammln 139 betai 140 141 ANOVA FUNCTIONS: F_oneway 142 F_value 143 144 SUPPORT FUNCTIONS: writecc 145 incr 146 sign (for Numpy arrays only) 147 sum 148 cumsum 149 ss 150 summult 151 sumdiffsquared 152 square_of_sums 153 shellsort 154 rankdata 155 outputpairedstats 156 findwithin 157 """ 158 ## CHANGE LOG: 159 ## =========== 160 ## 09-07-21 ... added capability for getting the 'proportion' out of l/amannwhitneyu (but comment-disabled) 161 ## 08-10-31 ... fixed import LinearAlgebra bug before glm fcns 162 ## 07-11-26 ... conversion for numpy started 163 ## 07-05-16 ... added Lin's Concordance Correlation Coefficient (alincc) and acov 164 ## 05-08-21 ... added "Dice's coefficient" 165 ## 04-10-26 ... added ap2t(), an ugly fcn for converting p-vals to T-vals 166 ## 04-04-03 ... added amasslinregress() function to do regression on N-D arrays 167 ## 03-01-03 ... CHANGED VERSION TO 0.6 168 ## fixed atsem() to properly handle limits=None case 169 ## improved histogram and median functions (estbinwidth) and 170 ## fixed atvar() function (wrong answers for neg numbers?!?) 171 ## 02-11-19 ... fixed attest_ind and attest_rel for div-by-zero Overflows 172 ## 02-05-10 ... fixed lchisqprob indentation (failed when df=even) 173 ## 00-12-28 ... removed aanova() to separate module, fixed licensing to 174 ## match Python License, fixed doc string & imports 175 ## 00-04-13 ... pulled all "global" statements, except from aanova() 176 ## added/fixed lots of documentation, removed io.py dependency 177 ## changed to version 0.5 178 ## 99-11-13 ... added asign() function 179 ## 99-11-01 ... changed version to 0.4 ... enough incremental changes now 180 ## 99-10-25 ... added acovariance and acorrelation functions 181 ## 99-10-10 ... fixed askew/akurtosis to avoid divide-by-zero errors 182 ## added aglm function (crude, but will be improved) 183 ## 99-10-04 ... upgraded acumsum, ass, asummult, asamplevar, avar, etc. to 184 ## all handle lists of 'dimension's and keepdims 185 ## REMOVED ar0, ar2, ar3, ar4 and replaced them with around 186 ## reinserted fixes for abetai to avoid math overflows 187 ## 99-09-05 ... rewrote achisqprob/aerfcc/aksprob/afprob/abetacf/abetai to 188 ## handle multi-dimensional arrays (whew!) 189 ## 99-08-30 ... fixed l/amoment, l/askew, l/akurtosis per D'Agostino (1990) 190 ## added anormaltest per same reference 191 ## re-wrote azprob to calc arrays of probs all at once 192 ## 99-08-22 ... edited attest_ind printing section so arrays could be rounded 193 ## 99-08-19 ... fixed amean and aharmonicmean for non-error(!) overflow on 194 ## short/byte arrays (mean of #s btw 100-300 = -150??) 195 ## 99-08-09 ... fixed asum so that the None case works for Byte arrays 196 ## 99-08-08 ... fixed 7/3 'improvement' to handle t-calcs on N-D arrays 197 ## 99-07-03 ... improved attest_ind, attest_rel (zero-division errortrap) 198 ## 99-06-24 ... fixed bug(?) in attest_ind (n1=a.shape[0]) 199 ## 04/11/99 ... added asignaltonoise, athreshold functions, changed all 200 ## max/min in array section to N.maximum/N.minimum, 201 ## fixed square_of_sums to prevent integer overflow 202 ## 04/10/99 ... !!! Changed function name ... sumsquared ==> square_of_sums 203 ## 03/18/99 ... Added ar0, ar2, ar3 and ar4 rounding functions 204 ## 02/28/99 ... Fixed aobrientransform to return an array rather than a list 205 ## 01/15/99 ... Essentially ceased updating list-versions of functions (!!!) 206 ## 01/13/99 ... CHANGED TO VERSION 0.3 207 ## fixed bug in a/lmannwhitneyu p-value calculation 208 ## 12/31/98 ... fixed variable-name bug in ldescribe 209 ## 12/19/98 ... fixed bug in findwithin (fcns needed pstat. prefix) 210 ## 12/16/98 ... changed amedianscore to return float (not array) for 1 score 211 ## 12/14/98 ... added atmin and atmax functions 212 ## removed umath from import line (not needed) 213 ## l/ageometricmean modified to reduce chance of overflows (take 214 ## nth root first, then multiply) 215 ## 12/07/98 ... added __version__variable (now 0.2) 216 ## removed all 'stats.' from anova() fcn 217 ## 12/06/98 ... changed those functions (except shellsort) that altered 218 ## arguments in-place ... cumsum, ranksort, ... 219 ## updated (and fixed some) doc-strings 220 ## 12/01/98 ... added anova() function (requires NumPy) 221 ## incorporated Dispatch class 222 ## 11/12/98 ... added functionality to amean, aharmonicmean, ageometricmean 223 ## added 'asum' function (added functionality to N.add.reduce) 224 ## fixed both moment and amoment (two errors) 225 ## changed name of skewness and askewness to skew and askew 226 ## fixed (a)histogram (which sometimes counted points <lowerlimit) 227 228 import pstat # required 3rd party module 229 import math, string, copy # required python modules 230 from types import * 231 232 __version__ = 0.6 233 234 ############# DISPATCH CODE ############## 235 236 237 class Dispatch: 238 """ 239 The Dispatch class, care of David Ascher, allows different functions to 240 be called depending on the argument types. This way, there can be one 241 function name regardless of the argument type. To access function doc 242 in stats.py module, prefix the function with an 'l' or 'a' for list or 243 array arguments, respectively. That is, print stats.lmean.__doc__ or 244 print stats.amean.__doc__ or whatever. 245 """ 246 247 def __init__(self, *tuples): 248 self._dispatch = {} 249 for func, types in tuples: 250 for t in types: 251 if t in self._dispatch.keys(): 252 raise ValueError, "can't have two dispatches on " + str(t) 253 self._dispatch[t] = func 254 self._types = self._dispatch.keys() 255 256 def __call__(self, arg1, *args, **kw): 257 if type(arg1) not in self._types: 258 raise TypeError, "don't know how to dispatch %s arguments" % type(arg1) 259 return apply(self._dispatch[type(arg1)], (arg1,) + args, kw) 260 261 ########################################################################## 262 ######################## LIST-BASED FUNCTIONS ######################## 263 ########################################################################## 264 265 ### Define these regardless 266 267 #################################### 268 ####### CENTRAL TENDENCY ######### 269 #################################### 270 271 272 def lgeometricmean(inlist): 273 """ 274 Calculates the geometric mean of the values in the passed list. 275 That is: n-th root of (x1 * x2 * ... * xn). Assumes a '1D' list. 276 277 Usage: lgeometricmean(inlist) 278 """ 279 mult = 1.0 280 one_over_n = 1.0 / len(inlist) 281 for item in inlist: 282 mult = mult * pow(item, one_over_n) 283 return mult 284 285 286 def lharmonicmean(inlist): 287 """ 288 Calculates the harmonic mean of the values in the passed list. 289 That is: n / (1/x1 + 1/x2 + ... + 1/xn). Assumes a '1D' list. 290 291 Usage: lharmonicmean(inlist) 292 """ 293 sum = 0 294 for item in inlist: 295 sum = sum + 1.0 / item 296 return len(inlist) / sum 297 298 299 def lmean(inlist): 300 """ 301 Returns the arithematic mean of the values in the passed list. 302 Assumes a '1D' list, but will function on the 1st dim of an array(!). 303 304 Usage: lmean(inlist) 305 """ 306 sum = 0 307 for item in inlist: 308 sum = sum + item 309 return sum / float(len(inlist)) 310 311 312 def lmedian(inlist, numbins=1000): 313 """ 314 Returns the computed median value of a list of numbers, given the 315 number of bins to use for the histogram (more bins brings the computed value 316 closer to the median score, default number of bins = 1000). See G.W. 317 Heiman's Basic Stats (1st Edition), or CRC Probability & Statistics. 318 319 Usage: lmedian (inlist, numbins=1000) 320 """ 321 (hist, smallest, binsize, extras) = histogram( 322 inlist, numbins, [min(inlist), max(inlist)]) # make histog 323 cumhist = cumsum(hist) # make cumulative histogram 324 for i in range(len(cumhist)): # get 1st(!) index holding 50%ile score 325 if cumhist[i] >= len(inlist) / 2.0: 326 cfbin = i 327 break 328 LRL = smallest + binsize * cfbin # get lower read limit of that bin 329 cfbelow = cumhist[cfbin - 1] 330 freq = float(hist[cfbin]) # frequency IN the 50%ile bin 331 median = LRL + ( 332 (len(inlist) / 2.0 - cfbelow) / float(freq)) * binsize # median formula 333 return median 334 335 336 def lmedianscore(inlist): 337 """ 338 Returns the 'middle' score of the passed list. If there is an even 339 number of scores, the mean of the 2 middle scores is returned. 340 341 Usage: lmedianscore(inlist) 342 """ 343 344 newlist = copy.deepcopy(inlist) 345 newlist.sort() 346 if len(newlist) % 2 == 0: # if even number of scores, average middle 2 347 index = len(newlist) / 2 # integer division correct 348 median = float(newlist[index] + newlist[index - 1]) / 2 349 else: 350 index = len(newlist) / 2 # int divsion gives mid value when count from 0 351 median = newlist[index] 352 return median 353 354 355 def lmode(inlist): 356 """ 357 Returns a list of the modal (most common) score(s) in the passed 358 list. If there is more than one such score, all are returned. The 359 bin-count for the mode(s) is also returned. 360 361 Usage: lmode(inlist) 362 Returns: bin-count for mode(s), a list of modal value(s) 363 """ 364 365 scores = pstat.unique(inlist) 366 scores.sort() 367 freq = [] 368 for item in scores: 369 freq.append(inlist.count(item)) 370 maxfreq = max(freq) 371 mode = [] 372 stillmore = 1 373 while stillmore: 374 try: 375 indx = freq.index(maxfreq) 376 mode.append(scores[indx]) 377 del freq[indx] 378 del scores[indx] 379 except ValueError: 380 stillmore = 0 381 return maxfreq, mode 382 383 #################################### 384 ############ MOMENTS ############# 385 #################################### 386 387 388 def lmoment(inlist, moment=1): 389 """ 390 Calculates the nth moment about the mean for a sample (defaults to 391 the 1st moment). Used to calculate coefficients of skewness and kurtosis. 392 393 Usage: lmoment(inlist,moment=1) 394 Returns: appropriate moment (r) from ... 1/n * SUM((inlist(i)-mean)**r) 395 """ 396 if moment == 1: 397 return 0.0 398 else: 399 mn = mean(inlist) 400 n = len(inlist) 401 s = 0 402 for x in inlist: 403 s = s + (x - mn)**moment 404 return s / float(n) 405 406 407 def lvariation(inlist): 408 """ 409 Returns the coefficient of variation, as defined in CRC Standard 410 Probability and Statistics, p.6. 411 412 Usage: lvariation(inlist) 413 """ 414 return 100.0 * samplestdev(inlist) / float(mean(inlist)) 415 416 417 def lskew(inlist): 418 """ 419 Returns the skewness of a distribution, as defined in Numerical 420 Recipies (alternate defn in CRC Standard Probability and Statistics, p.6.) 421 422 Usage: lskew(inlist) 423 """ 424 return moment(inlist, 3) / pow(moment(inlist, 2), 1.5) 425 426 427 def lkurtosis(inlist): 428 """ 429 Returns the kurtosis of a distribution, as defined in Numerical 430 Recipies (alternate defn in CRC Standard Probability and Statistics, p.6.) 431 432 Usage: lkurtosis(inlist) 433 """ 434 return moment(inlist, 4) / pow(moment(inlist, 2), 2.0) 435 436 437 def ldescribe(inlist): 438 """ 439 Returns some descriptive statistics of the passed list (assumed to be 1D). 440 441 Usage: ldescribe(inlist) 442 Returns: n, mean, standard deviation, skew, kurtosis 443 """ 444 n = len(inlist) 445 mm = (min(inlist), max(inlist)) 446 m = mean(inlist) 447 sd = stdev(inlist) 448 sk = skew(inlist) 449 kurt = kurtosis(inlist) 450 return n, mm, m, sd, sk, kurt 451 452 #################################### 453 ####### FREQUENCY STATS ########## 454 #################################### 455 456 457 def litemfreq(inlist): 458 """ 459 Returns a list of pairs. Each pair consists of one of the scores in inlist 460 and it's frequency count. Assumes a 1D list is passed. 461 462 Usage: litemfreq(inlist) 463 Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) 464 """ 465 scores = pstat.unique(inlist) 466 scores.sort() 467 freq = [] 468 for item in scores: 469 freq.append(inlist.count(item)) 470 return pstat.abut(scores, freq) 471 472 473 def lscoreatpercentile(inlist, percent): 474 """ 475 Returns the score at a given percentile relative to the distribution 476 given by inlist. 477 478 Usage: lscoreatpercentile(inlist,percent) 479 """ 480 if percent > 1: 481 print '\nDividing percent>1 by 100 in lscoreatpercentile().\n' 482 percent = percent / 100.0 483 targetcf = percent * len(inlist) 484 h, lrl, binsize, extras = histogram(inlist) 485 cumhist = cumsum(copy.deepcopy(h)) 486 for i in range(len(cumhist)): 487 if cumhist[i] >= targetcf: 488 break 489 score = binsize * ( 490 (targetcf - cumhist[i - 1]) / float(h[i])) + (lrl + binsize * i) 491 return score 492 493 494 def lpercentileofscore(inlist, score, histbins=10, defaultlimits=None): 495 """ 496 Returns the percentile value of a score relative to the distribution 497 given by inlist. Formula depends on the values used to histogram the data(!). 498 499 Usage: lpercentileofscore(inlist,score,histbins=10,defaultlimits=None) 500 """ 501 502 h, lrl, binsize, extras = histogram(inlist, histbins, defaultlimits) 503 cumhist = cumsum(copy.deepcopy(h)) 504 i = int((score - lrl) / float(binsize)) 505 pct = (cumhist[i - 1] + ( 506 (score - 507 (lrl + binsize * i)) / float(binsize)) * h[i]) / float(len(inlist)) * 100 508 return pct 509 510 511 def lhistogram(inlist, numbins=10, defaultreallimits=None, printextras=0): 512 """ 513 Returns (i) a list of histogram bin counts, (ii) the smallest value 514 of the histogram binning, and (iii) the bin width (the last 2 are not 515 necessarily integers). Default number of bins is 10. If no sequence object 516 is given for defaultreallimits, the routine picks (usually non-pretty) bins 517 spanning all the numbers in the inlist. 518 519 Usage: lhistogram (inlist, numbins=10, 520 defaultreallimits=None,suppressoutput=0) 521 Returns: list of bin values, lowerreallimit, binsize, extrapoints 522 """ 523 if (defaultreallimits <> None): 524 if type(defaultreallimits) not in [ListType, TupleType] or len( 525 defaultreallimits) == 1: # only one limit given, assumed to be lower one & upper is calc'd 526 lowerreallimit = defaultreallimits 527 upperreallimit = 1.000001 * max(inlist) 528 else: # assume both limits given 529 lowerreallimit = defaultreallimits[0] 530 upperreallimit = defaultreallimits[1] 531 binsize = (upperreallimit - lowerreallimit) / float(numbins) 532 else: # no limits given for histogram, both must be calc'd 533 estbinwidth = (max(inlist) - 534 min(inlist)) / float(numbins) + 1e-6 #1=>cover all 535 binsize = ((max(inlist) - min(inlist) + estbinwidth)) / float(numbins) 536 lowerreallimit = min(inlist) - binsize / 2 #lower real limit,1st bin 537 bins = [0] * (numbins) 538 extrapoints = 0 539 for num in inlist: 540 try: 541 if (num - lowerreallimit) < 0: 542 extrapoints = extrapoints + 1 543 else: 544 bintoincrement = int((num - lowerreallimit) / float(binsize)) 545 bins[bintoincrement] = bins[bintoincrement] + 1 546 except: 547 extrapoints = extrapoints + 1 548 if (extrapoints > 0 and printextras == 1): 549 print '\nPoints outside given histogram range =', extrapoints 550 return (bins, lowerreallimit, binsize, extrapoints) 551 552 553 def lcumfreq(inlist, numbins=10, defaultreallimits=None): 554 """ 555 Returns a cumulative frequency histogram, using the histogram function. 556 557 Usage: lcumfreq(inlist,numbins=10,defaultreallimits=None) 558 Returns: list of cumfreq bin values, lowerreallimit, binsize, extrapoints 559 """ 560 h, l, b, e = histogram(inlist, numbins, defaultreallimits) 561 cumhist = cumsum(copy.deepcopy(h)) 562 return cumhist, l, b, e 563 564 565 def lrelfreq(inlist, numbins=10, defaultreallimits=None): 566 """ 567 Returns a relative frequency histogram, using the histogram function. 568 569 Usage: lrelfreq(inlist,numbins=10,defaultreallimits=None) 570 Returns: list of cumfreq bin values, lowerreallimit, binsize, extrapoints 571 """ 572 h, l, b, e = histogram(inlist, numbins, defaultreallimits) 573 for i in range(len(h)): 574 h[i] = h[i] / float(len(inlist)) 575 return h, l, b, e 576 577 #################################### 578 ##### VARIABILITY FUNCTIONS ###### 579 #################################### 580 581 582 def lobrientransform(*args): 583 """ 584 Computes a transform on input data (any number of columns). Used to 585 test for homogeneity of variance prior to running one-way stats. From 586 Maxwell and Delaney, p.112. 587 588 Usage: lobrientransform(*args) 589 Returns: transformed data for use in an ANOVA 590 """ 591 TINY = 1e-10 592 k = len(args) 593 n = [0.0] * k 594 v = [0.0] * k 595 m = [0.0] * k 596 nargs = [] 597 for i in range(k): 598 nargs.append(copy.deepcopy(args[i])) 599 n[i] = float(len(nargs[i])) 600 v[i] = var(nargs[i]) 601 m[i] = mean(nargs[i]) 602 for j in range(k): 603 for i in range(n[j]): 604 t1 = (n[j] - 1.5) * n[j] * (nargs[j][i] - m[j])**2 605 t2 = 0.5 * v[j] * (n[j] - 1.0) 606 t3 = (n[j] - 1.0) * (n[j] - 2.0) 607 nargs[j][i] = (t1 - t2) / float(t3) 608 check = 1 609 for j in range(k): 610 if v[j] - mean(nargs[j]) > TINY: 611 check = 0 612 if check <> 1: 613 raise ValueError, 'Problem in obrientransform.' 614 else: 615 return nargs 616 617 618 def lsamplevar(inlist): 619 """ 620 Returns the variance of the values in the passed list using 621 N for the denominator (i.e., DESCRIBES the sample variance only). 622 623 Usage: lsamplevar(inlist) 624 """ 625 n = len(inlist) 626 mn = mean(inlist) 627 deviations = [] 628 for item in inlist: 629 deviations.append(item - mn) 630 return ss(deviations) / float(n) 631 632 633 def lsamplestdev(inlist): 634 """ 635 Returns the standard deviation of the values in the passed list using 636 N for the denominator (i.e., DESCRIBES the sample stdev only). 637 638 Usage: lsamplestdev(inlist) 639 """ 640 return math.sqrt(samplevar(inlist)) 641 642 643 def lcov(x, y, keepdims=0): 644 """ 645 Returns the estimated covariance of the values in the passed 646 array (i.e., N-1). Dimension can equal None (ravel array first), an 647 integer (the dimension over which to operate), or a sequence (operate 648 over multiple dimensions). Set keepdims=1 to return an array with the 649 same number of dimensions as inarray. 650 651 Usage: lcov(x,y,keepdims=0) 652 """ 653 654 n = len(x) 655 xmn = mean(x) 656 ymn = mean(y) 657 xdeviations = [0] * len(x) 658 ydeviations = [0] * len(y) 659 for i in range(len(x)): 660 xdeviations[i] = x[i] - xmn 661 ydeviations[i] = y[i] - ymn 662 ss = 0.0 663 for i in range(len(xdeviations)): 664 ss = ss + xdeviations[i] * ydeviations[i] 665 return ss / float(n - 1) 666 667 668 def lvar(inlist): 669 """ 670 Returns the variance of the values in the passed list using N-1 671 for the denominator (i.e., for estimating population variance). 672 673 Usage: lvar(inlist) 674 """ 675 n = len(inlist) 676 mn = mean(inlist) 677 deviations = [0] * len(inlist) 678 for i in range(len(inlist)): 679 deviations[i] = inlist[i] - mn 680 return ss(deviations) / float(n - 1) 681 682 683 def lstdev(inlist): 684 """ 685 Returns the standard deviation of the values in the passed list 686 using N-1 in the denominator (i.e., to estimate population stdev). 687 688 Usage: lstdev(inlist) 689 """ 690 return math.sqrt(var(inlist)) 691 692 693 def lsterr(inlist): 694 """ 695 Returns the standard error of the values in the passed list using N-1 696 in the denominator (i.e., to estimate population standard error). 697 698 Usage: lsterr(inlist) 699 """ 700 return stdev(inlist) / float(math.sqrt(len(inlist))) 701 702 703 def lsem(inlist): 704 """ 705 Returns the estimated standard error of the mean (sx-bar) of the 706 values in the passed list. sem = stdev / sqrt(n) 707 708 Usage: lsem(inlist) 709 """ 710 sd = stdev(inlist) 711 n = len(inlist) 712 return sd / math.sqrt(n) 713 714 715 def lz(inlist, score): 716 """ 717 Returns the z-score for a given input score, given that score and the 718 list from which that score came. Not appropriate for population calculations. 719 720 Usage: lz(inlist, score) 721 """ 722 z = (score - mean(inlist)) / samplestdev(inlist) 723 return z 724 725 726 def lzs(inlist): 727 """ 728 Returns a list of z-scores, one for each score in the passed list. 729 730 Usage: lzs(inlist) 731 """ 732 zscores = [] 733 for item in inlist: 734 zscores.append(z(inlist, item)) 735 return zscores 736 737 #################################### 738 ####### TRIMMING FUNCTIONS ####### 739 #################################### 740 741 742 def ltrimboth(l, proportiontocut): 743 """ 744 Slices off the passed proportion of items from BOTH ends of the passed 745 list (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 'rightmost' 746 10% of scores. Assumes list is sorted by magnitude. Slices off LESS if 747 proportion results in a non-integer slice index (i.e., conservatively 748 slices off proportiontocut). 749 750 Usage: ltrimboth (l,proportiontocut) 751 Returns: trimmed version of list l 752 """ 753 lowercut = int(proportiontocut * len(l)) 754 uppercut = len(l) - lowercut 755 return l[lowercut:uppercut] 756 757 758 def ltrim1(l, proportiontocut, tail='right'): 759 """ 760 Slices off the passed proportion of items from ONE end of the passed 761 list (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 762 10% of scores). Slices off LESS if proportion results in a non-integer 763 slice index (i.e., conservatively slices off proportiontocut). 764 765 Usage: ltrim1 (l,proportiontocut,tail='right') or set tail='left' 766 Returns: trimmed version of list l 767 """ 768 if tail == 'right': 769 lowercut = 0 770 uppercut = len(l) - int(proportiontocut * len(l)) 771 elif tail == 'left': 772 lowercut = int(proportiontocut * len(l)) 773 uppercut = len(l) 774 return l[lowercut:uppercut] 775 776 #################################### 777 ##### CORRELATION FUNCTIONS ###### 778 #################################### 779 780 781 def lpaired(x, y): 782 """ 783 Interactively determines the type of data and then runs the 784 appropriated statistic for paired group data. 785 786 Usage: lpaired(x,y) 787 Returns: appropriate statistic name, value, and probability 788 """ 789 samples = '' 790 while samples not in ['i', 'r', 'I', 'R', 'c', 'C']: 791 print '\nIndependent or related samples, or correlation (i,r,c): ', 792 samples = raw_input() 793 794 if samples in ['i', 'I', 'r', 'R']: 795 print '\nComparing variances ...', 796 # USE O'BRIEN'S TEST FOR HOMOGENEITY OF VARIANCE, Maxwell & delaney, p.112 797 r = obrientransform(x, y) 798 f, p = F_oneway(pstat.colex(r, 0), pstat.colex(r, 1)) 799 if p < 0.05: 800 vartype = 'unequal, p=' + str(round(p, 4)) 801 else: 802 vartype = 'equal' 803 print vartype 804 if samples in ['i', 'I']: 805 if vartype[0] == 'e': 806 t, p = ttest_ind(x, y, 0) 807 print '\nIndependent samples t-test: ', round(t, 4), round(p, 4) 808 else: 809 if len(x) > 20 or len(y) > 20: 810 z, p = ranksums(x, y) 811 print '\nRank Sums test (NONparametric, n>20): ', round(z, 4), round( 812 p, 4) 813 else: 814 u, p = mannwhitneyu(x, y) 815 print '\nMann-Whitney U-test (NONparametric, ns<20): ', round( 816 u, 4), round(p, 4) 817 818 else: # RELATED SAMPLES 819 if vartype[0] == 'e': 820 t, p = ttest_rel(x, y, 0) 821 print '\nRelated samples t-test: ', round(t, 4), round(p, 4) 822 else: 823 t, p = ranksums(x, y) 824 print '\nWilcoxon T-test (NONparametric): ', round(t, 4), round(p, 4) 825 else: # CORRELATION ANALYSIS 826 corrtype = '' 827 while corrtype not in ['c', 'C', 'r', 'R', 'd', 'D']: 828 print '\nIs the data Continuous, Ranked, or Dichotomous (c,r,d): ', 829 corrtype = raw_input() 830 if corrtype in ['c', 'C']: 831 m, b, r, p, see = linregress(x, y) 832 print '\nLinear regression for continuous variables ...' 833 lol = [['Slope', 'Intercept', 'r', 'Prob', 'SEestimate'], 834 [round(m, 4), round(b, 4), round(r, 4), round(p, 4), round(see, 4)] 835 ] 836 pstat.printcc(lol) 837 elif corrtype in ['r', 'R']: 838 r, p = spearmanr(x, y) 839 print '\nCorrelation for ranked variables ...' 840 print "Spearman's r: ", round(r, 4), round(p, 4) 841 else: # DICHOTOMOUS 842 r, p = pointbiserialr(x, y) 843 print '\nAssuming x contains a dichotomous variable ...' 844 print 'Point Biserial r: ', round(r, 4), round(p, 4) 845 print '\n\n' 846 return None 847 848 849 def lpearsonr(x, y): 850 """ 851 Calculates a Pearson correlation coefficient and the associated 852 probability value. Taken from Heiman's Basic Statistics for the Behav. 853 Sci (2nd), p.195. 854 855 Usage: lpearsonr(x,y) where x and y are equal-length lists 856 Returns: Pearson's r value, two-tailed p-value 857 """ 858 TINY = 1.0e-30 859 if len(x) <> len(y): 860 raise ValueError, 'Input values not paired in pearsonr. Aborting.' 861 n = len(x) 862 x = map(float, x) 863 y = map(float, y) 864 xmean = mean(x) 865 ymean = mean(y) 866 r_num = n * (summult(x, y)) - sum(x) * sum(y) 867 r_den = math.sqrt((n * ss(x) - square_of_sums(x)) * 868 (n * ss(y) - square_of_sums(y))) 869 r = (r_num / r_den) # denominator already a float 870 df = n - 2 871 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 872 prob = betai(0.5 * df, 0.5, df / float(df + t * t)) 873 return r, prob 874 875 876 def llincc(x, y): 877 """ 878 Calculates Lin's concordance correlation coefficient. 879 880 Usage: alincc(x,y) where x, y are equal-length arrays 881 Returns: Lin's CC 882 """ 883 covar = lcov(x, y) * (len(x) - 1) / float(len(x)) # correct denom to n 884 xvar = lvar(x) * (len(x) - 1) / float(len(x)) # correct denom to n 885 yvar = lvar(y) * (len(y) - 1) / float(len(y)) # correct denom to n 886 lincc = (2 * covar) / ((xvar + yvar) + ((amean(x) - amean(y))**2)) 887 return lincc 888 889 890 def lspearmanr(x, y): 891 """ 892 Calculates a Spearman rank-order correlation coefficient. Taken 893 from Heiman's Basic Statistics for the Behav. Sci (1st), p.192. 894 895 Usage: lspearmanr(x,y) where x and y are equal-length lists 896 Returns: Spearman's r, two-tailed p-value 897 """ 898 TINY = 1e-30 899 if len(x) <> len(y): 900 raise ValueError, 'Input values not paired in spearmanr. Aborting.' 901 n = len(x) 902 rankx = rankdata(x) 903 ranky = rankdata(y) 904 dsq = sumdiffsquared(rankx, ranky) 905 rs = 1 - 6 * dsq / float(n * (n**2 - 1)) 906 t = rs * math.sqrt((n - 2) / ((rs + 1.0) * (1.0 - rs))) 907 df = n - 2 908 probrs = betai(0.5 * df, 0.5, df / (df + t * t)) # t already a float 909 # probability values for rs are from part 2 of the spearman function in 910 # Numerical Recipies, p.510. They are close to tables, but not exact. (?) 911 return rs, probrs 912 913 914 def lpointbiserialr(x, y): 915 """ 916 Calculates a point-biserial correlation coefficient and the associated 917 probability value. Taken from Heiman's Basic Statistics for the Behav. 918 Sci (1st), p.194. 919 920 Usage: lpointbiserialr(x,y) where x,y are equal-length lists 921 Returns: Point-biserial r, two-tailed p-value 922 """ 923 TINY = 1e-30 924 if len(x) <> len(y): 925 raise ValueError, 'INPUT VALUES NOT PAIRED IN pointbiserialr. ABORTING.' 926 data = pstat.abut(x, y) 927 categories = pstat.unique(x) 928 if len(categories) <> 2: 929 raise ValueError, 'Exactly 2 categories required for pointbiserialr().' 930 else: # there are 2 categories, continue 931 codemap = pstat.abut(categories, range(2)) 932 recoded = pstat.recode(data, codemap, 0) 933 x = pstat.linexand(data, 0, categories[0]) 934 y = pstat.linexand(data, 0, categories[1]) 935 xmean = mean(pstat.colex(x, 1)) 936 ymean = mean(pstat.colex(y, 1)) 937 n = len(data) 938 adjust = math.sqrt((len(x) / float(n)) * (len(y) / float(n))) 939 rpb = (ymean - xmean) / samplestdev(pstat.colex(data, 1)) * adjust 940 df = n - 2 941 t = rpb * math.sqrt(df / ((1.0 - rpb + TINY) * (1.0 + rpb + TINY))) 942 prob = betai(0.5 * df, 0.5, df / (df + t * t)) # t already a float 943 return rpb, prob 944 945 946 def lkendalltau(x, y): 947 """ 948 Calculates Kendall's tau ... correlation of ordinal data. Adapted 949 from function kendl1 in Numerical Recipies. Needs good test-routine.@@@ 950 951 Usage: lkendalltau(x,y) 952 Returns: Kendall's tau, two-tailed p-value 953 """ 954 n1 = 0 955 n2 = 0 956 iss = 0 957 for j in range(len(x) - 1): 958 for k in range(j, len(y)): 959 a1 = x[j] - x[k] 960 a2 = y[j] - y[k] 961 aa = a1 * a2 962 if (aa): # neither list has a tie 963 n1 = n1 + 1 964 n2 = n2 + 1 965 if aa > 0: 966 iss = iss + 1 967 else: 968 iss = iss - 1 969 else: 970 if (a1): 971 n1 = n1 + 1 972 else: 973 n2 = n2 + 1 974 tau = iss / math.sqrt(n1 * n2) 975 svar = (4.0 * len(x) + 10.0) / (9.0 * len(x) * (len(x) - 1)) 976 z = tau / math.sqrt(svar) 977 prob = erfcc(abs(z) / 1.4142136) 978 return tau, prob 979 980 981 def llinregress(x, y): 982 """ 983 Calculates a regression line on x,y pairs. 984 985 Usage: llinregress(x,y) x,y are equal-length lists of x-y coordinates 986 Returns: slope, intercept, r, two-tailed prob, sterr-of-estimate 987 """ 988 TINY = 1.0e-20 989 if len(x) <> len(y): 990 raise ValueError, 'Input values not paired in linregress. Aborting.' 991 n = len(x) 992 x = map(float, x) 993 y = map(float, y) 994 xmean = mean(x) 995 ymean = mean(y) 996 r_num = float(n * (summult(x, y)) - sum(x) * sum(y)) 997 r_den = math.sqrt((n * ss(x) - square_of_sums(x)) * 998 (n * ss(y) - square_of_sums(y))) 999 r = r_num / r_den 1000 z = 0.5 * math.log((1.0 + r + TINY) / (1.0 - r + TINY)) 1001 df = n - 2 1002 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 1003 prob = betai(0.5 * df, 0.5, df / (df + t * t)) 1004 slope = r_num / float(n * ss(x) - square_of_sums(x)) 1005 intercept = ymean - slope * xmean 1006 sterrest = math.sqrt(1 - r * r) * samplestdev(y) 1007 return slope, intercept, r, prob, sterrest 1008 1009 #################################### 1010 ##### INFERENTIAL STATISTICS ##### 1011 #################################### 1012 1013 1014 def lttest_1samp(a, popmean, printit=0, name='Sample', writemode='a'): 1015 """ 1016 Calculates the t-obtained for the independent samples T-test on ONE group 1017 of scores a, given a population mean. If printit=1, results are printed 1018 to the screen. If printit='filename', the results are output to 'filename' 1019 using the given writemode (default=append). Returns t-value, and prob. 1020 1021 Usage: lttest_1samp(a,popmean,Name='Sample',printit=0,writemode='a') 1022 Returns: t-value, two-tailed prob 1023 """ 1024 x = mean(a) 1025 v = var(a) 1026 n = len(a) 1027 df = n - 1 1028 svar = ((n - 1) * v) / float(df) 1029 t = (x - popmean) / math.sqrt(svar * (1.0 / n)) 1030 prob = betai(0.5 * df, 0.5, float(df) / (df + t * t)) 1031 1032 if printit <> 0: 1033 statname = 'Single-sample T-test.' 1034 outputpairedstats(printit, writemode, 'Population', '--', popmean, 0, 0, 0, 1035 name, n, x, v, min(a), max(a), statname, t, prob) 1036 return t, prob 1037 1038 1039 def lttest_ind(a, b, printit=0, name1='Samp1', name2='Samp2', writemode='a'): 1040 """ 1041 Calculates the t-obtained T-test on TWO INDEPENDENT samples of 1042 scores a, and b. From Numerical Recipies, p.483. If printit=1, results 1043 are printed to the screen. If printit='filename', the results are output 1044 to 'filename' using the given writemode (default=append). Returns t-value, 1045 and prob. 1046 1047 Usage: lttest_ind(a,b,printit=0,name1='Samp1',name2='Samp2',writemode='a') 1048 Returns: t-value, two-tailed prob 1049 """ 1050 x1 = mean(a) 1051 x2 = mean(b) 1052 v1 = stdev(a)**2 1053 v2 = stdev(b)**2 1054 n1 = len(a) 1055 n2 = len(b) 1056 df = n1 + n2 - 2 1057 svar = ((n1 - 1) * v1 + (n2 - 1) * v2) / float(df) 1058 if not svar: 1059 svar = 1.0e-26 1060 t = (x1 - x2) / math.sqrt(svar * (1.0 / n1 + 1.0 / n2)) 1061 prob = betai(0.5 * df, 0.5, df / (df + t * t)) 1062 1063 if printit <> 0: 1064 statname = 'Independent samples T-test.' 1065 outputpairedstats(printit, writemode, name1, n1, x1, v1, min(a), max(a), 1066 name2, n2, x2, v2, min(b), max(b), statname, t, prob) 1067 return t, prob 1068 1069 1070 def lttest_rel(a, 1071 b, 1072 printit=0, 1073 name1='Sample1', 1074 name2='Sample2', 1075 writemode='a'): 1076 """ 1077 Calculates the t-obtained T-test on TWO RELATED samples of scores, 1078 a and b. From Numerical Recipies, p.483. If printit=1, results are 1079 printed to the screen. If printit='filename', the results are output to 1080 'filename' using the given writemode (default=append). Returns t-value, 1081 and prob. 1082 1083 Usage: lttest_rel(a,b,printit=0,name1='Sample1',name2='Sample2',writemode='a') 1084 Returns: t-value, two-tailed prob 1085 """ 1086 if len(a) <> len(b): 1087 raise ValueError, 'Unequal length lists in ttest_rel.' 1088 x1 = mean(a) 1089 x2 = mean(b) 1090 v1 = var(a) 1091 v2 = var(b) 1092 n = len(a) 1093 cov = 0 1094 for i in range(len(a)): 1095 cov = cov + (a[i] - x1) * (b[i] - x2) 1096 df = n - 1 1097 cov = cov / float(df) 1098 sd = math.sqrt((v1 + v2 - 2.0 * cov) / float(n)) 1099 t = (x1 - x2) / sd 1100 prob = betai(0.5 * df, 0.5, df / (df + t * t)) 1101 1102 if printit <> 0: 1103 statname = 'Related samples T-test.' 1104 outputpairedstats(printit, writemode, name1, n, x1, v1, min(a), max(a), 1105 name2, n, x2, v2, min(b), max(b), statname, t, prob) 1106 return t, prob 1107 1108 1109 def lchisquare(f_obs, f_exp=None): 1110 """ 1111 Calculates a one-way chi square for list of observed frequencies and returns 1112 the result. If no expected frequencies are given, the total N is assumed to 1113 be equally distributed across all groups. 1114 1115 Usage: lchisquare(f_obs, f_exp=None) f_obs = list of observed cell freq. 1116 Returns: chisquare-statistic, associated p-value 1117 """ 1118 k = len(f_obs) # number of groups 1119 if f_exp == None: 1120 f_exp = [sum(f_obs) / float(k)] * len(f_obs) # create k bins with = freq. 1121 chisq = 0 1122 for i in range(len(f_obs)): 1123 chisq = chisq + (f_obs[i] - f_exp[i])**2 / float(f_exp[i]) 1124 return chisq, chisqprob(chisq, k - 1) 1125 1126 1127 def lks_2samp(data1, data2): 1128 """ 1129 Computes the Kolmogorov-Smirnof statistic on 2 samples. From 1130 Numerical Recipies in C, page 493. 1131 1132 Usage: lks_2samp(data1,data2) data1&2 are lists of values for 2 conditions 1133 Returns: KS D-value, associated p-value 1134 """ 1135 j1 = 0 1136 j2 = 0 1137 fn1 = 0.0 1138 fn2 = 0.0 1139 n1 = len(data1) 1140 n2 = len(data2) 1141 en1 = n1 1142 en2 = n2 1143 d = 0.0 1144 data1.sort() 1145 data2.sort() 1146 while j1 < n1 and j2 < n2: 1147 d1 = data1[j1] 1148 d2 = data2[j2] 1149 if d1 <= d2: 1150 fn1 = (j1) / float(en1) 1151 j1 = j1 + 1 1152 if d2 <= d1: 1153 fn2 = (j2) / float(en2) 1154 j2 = j2 + 1 1155 dt = (fn2 - fn1) 1156 if math.fabs(dt) > math.fabs(d): 1157 d = dt 1158 try: 1159 en = math.sqrt(en1 * en2 / float(en1 + en2)) 1160 prob = ksprob((en + 0.12 + 0.11 / en) * abs(d)) 1161 except: 1162 prob = 1.0 1163 return d, prob 1164 1165 1166 def lmannwhitneyu(x, y): 1167 """ 1168 Calculates a Mann-Whitney U statistic on the provided scores and 1169 returns the result. Use only when the n in each condition is < 20 and 1170 you have 2 independent samples of ranks. NOTE: Mann-Whitney U is 1171 significant if the u-obtained is LESS THAN or equal to the critical 1172 value of U found in the tables. Equivalent to Kruskal-Wallis H with 1173 just 2 groups. 1174 1175 Usage: lmannwhitneyu(data) 1176 Returns: u-statistic, one-tailed p-value (i.e., p(z(U))) 1177 """ 1178 n1 = len(x) 1179 n2 = len(y) 1180 ranked = rankdata(x + y) 1181 rankx = ranked[0:n1] # get the x-ranks 1182 ranky = ranked[n1:] # the rest are y-ranks 1183 u1 = n1 * n2 + (n1 * (n1 + 1)) / 2.0 - sum(rankx) # calc U for x 1184 u2 = n1 * n2 - u1 # remainder is U for y 1185 bigu = max(u1, u2) 1186 smallu = min(u1, u2) 1187 proportion = bigu / float(n1 * n2) 1188 T = math.sqrt(tiecorrect(ranked)) # correction factor for tied scores 1189 if T == 0: 1190 raise ValueError, 'All numbers are identical in lmannwhitneyu' 1191 sd = math.sqrt(T * n1 * n2 * (n1 + n2 + 1) / 12.0) 1192 z = abs((bigu - n1 * n2 / 2.0) / sd) # normal approximation for prob calc 1193 return smallu, 1.0 - zprob(z) #, proportion 1194 1195 1196 def ltiecorrect(rankvals): 1197 """ 1198 Corrects for ties in Mann Whitney U and Kruskal Wallis H tests. See 1199 Siegel, S. (1956) Nonparametric Statistics for the Behavioral Sciences. 1200 New York: McGraw-Hill. Code adapted from |Stat rankind.c code. 1201 1202 Usage: ltiecorrect(rankvals) 1203 Returns: T correction factor for U or H 1204 """ 1205 sorted, posn = shellsort(rankvals) 1206 n = len(sorted) 1207 T = 0.0 1208 i = 0 1209 while (i < n - 1): 1210 if sorted[i] == sorted[i + 1]: 1211 nties = 1 1212 while (i < n - 1) and (sorted[i] == sorted[i + 1]): 1213 nties = nties + 1 1214 i = i + 1 1215 T = T + nties**3 - nties 1216 i = i + 1 1217 T = T / float(n**3 - n) 1218 return 1.0 - T 1219 1220 1221 def lranksums(x, y): 1222 """ 1223 Calculates the rank sums statistic on the provided scores and 1224 returns the result. Use only when the n in each condition is > 20 and you 1225 have 2 independent samples of ranks. 1226 1227 Usage: lranksums(x,y) 1228 Returns: a z-statistic, two-tailed p-value 1229 """ 1230 n1 = len(x) 1231 n2 = len(y) 1232 alldata = x + y 1233 ranked = rankdata(alldata) 1234 x = ranked[:n1] 1235 y = ranked[n1:] 1236 s = sum(x) 1237 expected = n1 * (n1 + n2 + 1) / 2.0 1238 z = (s - expected) / math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12.0) 1239 prob = 2 * (1.0 - zprob(abs(z))) 1240 return z, prob 1241 1242 1243 def lwilcoxont(x, y): 1244 """ 1245 Calculates the Wilcoxon T-test for related samples and returns the 1246 result. A non-parametric T-test. 1247 1248 Usage: lwilcoxont(x,y) 1249 Returns: a t-statistic, two-tail probability estimate 1250 """ 1251 if len(x) <> len(y): 1252 raise ValueError, 'Unequal N in wilcoxont. Aborting.' 1253 d = [] 1254 for i in range(len(x)): 1255 diff = x[i] - y[i] 1256 if diff <> 0: 1257 d.append(diff) 1258 count = len(d) 1259 absd = map(abs, d) 1260 absranked = rankdata(absd) 1261 r_plus = 0.0 1262 r_minus = 0.0 1263 for i in range(len(absd)): 1264 if d[i] < 0: 1265 r_minus = r_minus + absranked[i] 1266 else: 1267 r_plus = r_plus + absranked[i] 1268 wt = min(r_plus, r_minus) 1269 mn = count * (count + 1) * 0.25 1270 se = math.sqrt(count * (count + 1) * (2.0 * count + 1.0) / 24.0) 1271 z = math.fabs(wt - mn) / se 1272 prob = 2 * (1.0 - zprob(abs(z))) 1273 return wt, prob 1274 1275 1276 def lkruskalwallish(*args): 1277 """ 1278 The Kruskal-Wallis H-test is a non-parametric ANOVA for 3 or more 1279 groups, requiring at least 5 subjects in each group. This function 1280 calculates the Kruskal-Wallis H-test for 3 or more independent samples 1281 and returns the result. 1282 1283 Usage: lkruskalwallish(*args) 1284 Returns: H-statistic (corrected for ties), associated p-value 1285 """ 1286 args = list(args) 1287 n = [0] * len(args) 1288 all = [] 1289 n = map(len, args) 1290 for i in range(len(args)): 1291 all = all + args[i] 1292 ranked = rankdata(all) 1293 T = tiecorrect(ranked) 1294 for i in range(len(args)): 1295 args[i] = ranked[0:n[i]] 1296 del ranked[0:n[i]] 1297 rsums = [] 1298 for i in range(len(args)): 1299 rsums.append(sum(args[i])**2) 1300 rsums[i] = rsums[i] / float(n[i]) 1301 ssbn = sum(rsums) 1302 totaln = sum(n) 1303 h = 12.0 / (totaln * (totaln + 1)) * ssbn - 3 * (totaln + 1) 1304 df = len(args) - 1 1305 if T == 0: 1306 raise ValueError, 'All numbers are identical in lkruskalwallish' 1307 h = h / float(T) 1308 return h, chisqprob(h, df) 1309 1310 1311 def lfriedmanchisquare(*args): 1312 """ 1313 Friedman Chi-Square is a non-parametric, one-way within-subjects 1314 ANOVA. This function calculates the Friedman Chi-square test for repeated 1315 measures and returns the result, along with the associated probability 1316 value. It assumes 3 or more repeated measures. Only 3 levels requires a 1317 minimum of 10 subjects in the study. Four levels requires 5 subjects per 1318 level(??). 1319 1320 Usage: lfriedmanchisquare(*args) 1321 Returns: chi-square statistic, associated p-value 1322 """ 1323 k = len(args) 1324 if k < 3: 1325 raise ValueError, 'Less than 3 levels. Friedman test not appropriate.' 1326 n = len(args[0]) 1327 data = apply(pstat.abut, tuple(args)) 1328 for i in range(len(data)): 1329 data[i] = rankdata(data[i]) 1330 ssbn = 0 1331 for i in range(k): 1332 ssbn = ssbn + sum(args[i])**2 1333 chisq = 12.0 / (k * n * (k + 1)) * ssbn - 3 * n * (k + 1) 1334 return chisq, chisqprob(chisq, k - 1) 1335 1336 #################################### 1337 #### PROBABILITY CALCULATIONS #### 1338 #################################### 1339 1340 1341 def lchisqprob(chisq, df): 1342 """ 1343 Returns the (1-tailed) probability value associated with the provided 1344 chi-square value and df. Adapted from chisq.c in Gary Perlman's |Stat. 1345 1346 Usage: lchisqprob(chisq,df) 1347 """ 1348 BIG = 20.0 1349 1350 def ex(x): 1351 BIG = 20.0 1352 if x < -BIG: 1353 return 0.0 1354 else: 1355 return math.exp(x) 1356 1357 if chisq <= 0 or df < 1: 1358 return 1.0 1359 a = 0.5 * chisq 1360 if df % 2 == 0: 1361 even = 1 1362 else: 1363 even = 0 1364 if df > 1: 1365 y = ex(-a) 1366 if even: 1367 s = y 1368 else: 1369 s = 2.0 * zprob(-math.sqrt(chisq)) 1370 if (df > 2): 1371 chisq = 0.5 * (df - 1.0) 1372 if even: 1373 z = 1.0 1374 else: 1375 z = 0.5 1376 if a > BIG: 1377 if even: 1378 e = 0.0 1379 else: 1380 e = math.log(math.sqrt(math.pi)) 1381 c = math.log(a) 1382 while (z <= chisq): 1383 e = math.log(z) + e 1384 s = s + ex(c * z - a - e) 1385 z = z + 1.0 1386 return s 1387 else: 1388 if even: 1389 e = 1.0 1390 else: 1391 e = 1.0 / math.sqrt(math.pi) / math.sqrt(a) 1392 c = 0.0 1393 while (z <= chisq): 1394 e = e * (a / float(z)) 1395 c = c + e 1396 z = z + 1.0 1397 return (c * y + s) 1398 else: 1399 return s 1400 1401 1402 def lerfcc(x): 1403 """ 1404 Returns the complementary error function erfc(x) with fractional 1405 error everywhere less than 1.2e-7. Adapted from Numerical Recipies. 1406 1407 Usage: lerfcc(x) 1408 """ 1409 z = abs(x) 1410 t = 1.0 / (1.0 + 0.5 * z) 1411 ans = t * math.exp(-z * z - 1.26551223 + t * (1.00002368 + t * ( 1412 0.37409196 + t * (0.09678418 + t * (-0.18628806 + t * (0.27886807 + t * ( 1413 -1.13520398 + t * (1.48851587 + t * (-0.82215223 + t * 0.17087277))))) 1414 )))) 1415 if x >= 0: 1416 return ans 1417 else: 1418 return 2.0 - ans 1419 1420 1421 def lzprob(z): 1422 """ 1423 Returns the area under the normal curve 'to the left of' the given z value. 1424 Thus, 1425 for z<0, zprob(z) = 1-tail probability 1426 for z>0, 1.0-zprob(z) = 1-tail probability 1427 for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability 1428 Adapted from z.c in Gary Perlman's |Stat. 1429 1430 Usage: lzprob(z) 1431 """ 1432 Z_MAX = 6.0 # maximum meaningful z-value 1433 if z == 0.0: 1434 x = 0.0 1435 else: 1436 y = 0.5 * math.fabs(z) 1437 if y >= (Z_MAX * 0.5): 1438 x = 1.0 1439 elif (y < 1.0): 1440 w = y * y 1441 x = (( 1442 ((((((0.000124818987 * w - 0.001075204047) * w + 0.005198775019) * w - 1443 0.019198292004) * w + 0.059054035642) * w - 0.151968751364) * w + 1444 0.319152932694) * w - 0.531923007300) * w + 0.797884560593) * y * 2.0 1445 else: 1446 y = y - 2.0 1447 x = ((((((( 1448 ((((((-0.000045255659 * y + 0.000152529290) * y - 0.000019538132) * y 1449 - 0.000676904986) * y + 0.001390604284) * y - 0.000794620820) * y 1450 - 0.002034254874) * y + 0.006549791214) * y - 0.010557625006) * y + 1451 0.011630447319) * y - 0.009279453341) * y + 0.005353579108) * y - 1452 0.002141268741) * y + 0.000535310849) * y + 0.999936657524 1453 if z > 0.0: 1454 prob = ((x + 1.0) * 0.5) 1455 else: 1456 prob = ((1.0 - x) * 0.5) 1457 return prob 1458 1459 1460 def lksprob(alam): 1461 """ 1462 Computes a Kolmolgorov-Smirnov t-test significance level. Adapted from 1463 Numerical Recipies. 1464 1465 Usage: lksprob(alam) 1466 """ 1467 fac = 2.0 1468 sum = 0.0 1469 termbf = 0.0 1470 a2 = -2.0 * alam * alam 1471 for j in range(1, 201): 1472 term = fac * math.exp(a2 * j * j) 1473 sum = sum + term 1474 if math.fabs(term) <= (0.001 * termbf) or math.fabs(term) < (1.0e-8 * sum): 1475 return sum 1476 fac = -fac 1477 termbf = math.fabs(term) 1478 return 1.0 # Get here only if fails to converge; was 0.0!! 1479 1480 1481 def lfprob(dfnum, dfden, F): 1482 """ 1483 Returns the (1-tailed) significance level (p-value) of an F 1484 statistic given the degrees of freedom for the numerator (dfR-dfF) and 1485 the degrees of freedom for the denominator (dfF). 1486 1487 Usage: lfprob(dfnum, dfden, F) where usually dfnum=dfbn, dfden=dfwn 1488 """ 1489 p = betai(0.5 * dfden, 0.5 * dfnum, dfden / float(dfden + dfnum * F)) 1490 return p 1491 1492 1493 def lbetacf(a, b, x): 1494 """ 1495 This function evaluates the continued fraction form of the incomplete 1496 Beta function, betai. (Adapted from: Numerical Recipies in C.) 1497 1498 Usage: lbetacf(a,b,x) 1499 """ 1500 ITMAX = 200 1501 EPS = 3.0e-7 1502 1503 bm = az = am = 1.0 1504 qab = a + b 1505 qap = a + 1.0 1506 qam = a - 1.0 1507 bz = 1.0 - qab * x / qap 1508 for i in range(ITMAX + 1): 1509 em = float(i + 1) 1510 tem = em + em 1511 d = em * (b - em) * x / ((qam + tem) * (a + tem)) 1512 ap = az + d * am 1513 bp = bz + d * bm 1514 d = -(a + em) * (qab + em) * x / ((qap + tem) * (a + tem)) 1515 app = ap + d * az 1516 bpp = bp + d * bz 1517 aold = az 1518 am = ap / bpp 1519 bm = bp / bpp 1520 az = app / bpp 1521 bz = 1.0 1522 if (abs(az - aold) < (EPS * abs(az))): 1523 return az 1524 print 'a or b too big, or ITMAX too small in Betacf.' 1525 1526 1527 def lgammln(xx): 1528 """ 1529 Returns the gamma function of xx. 1530 Gamma(z) = Integral(0,infinity) of t^(z-1)exp(-t) dt. 1531 (Adapted from: Numerical Recipies in C.) 1532 1533 Usage: lgammln(xx) 1534 """ 1535 1536 coeff = [76.18009173, -86.50532033, 24.01409822, -1.231739516, 0.120858003e-2, 1537 -0.536382e-5] 1538 x = xx - 1.0 1539 tmp = x + 5.5 1540 tmp = tmp - (x + 0.5) * math.log(tmp) 1541 ser = 1.0 1542 for j in range(len(coeff)): 1543 x = x + 1 1544 ser = ser + coeff[j] / x 1545 return -tmp + math.log(2.50662827465 * ser) 1546 1547 1548 def lbetai(a, b, x): 1549 """ 1550 Returns the incomplete beta function: 1551 1552 I-sub-x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt) 1553 1554 where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma 1555 function of a. The continued fraction formulation is implemented here, 1556 using the betacf function. (Adapted from: Numerical Recipies in C.) 1557 1558 Usage: lbetai(a,b,x) 1559 """ 1560 if (x < 0.0 or x > 1.0): 1561 raise ValueError, 'Bad x in lbetai' 1562 if (x == 0.0 or x == 1.0): 1563 bt = 0.0 1564 else: 1565 bt = math.exp(gammln(a + b) - gammln(a) - gammln(b) + a * math.log(x) + b * 1566 math.log(1.0 - x)) 1567 if (x < (a + 1.0) / (a + b + 2.0)): 1568 return bt * betacf(a, b, x) / float(a) 1569 else: 1570 return 1.0 - bt * betacf(b, a, 1.0 - x) / float(b) 1571 1572 #################################### 1573 ####### ANOVA CALCULATIONS ####### 1574 #################################### 1575 1576 1577 def lF_oneway(*lists): 1578 """ 1579 Performs a 1-way ANOVA, returning an F-value and probability given 1580 any number of groups. From Heiman, pp.394-7. 1581 1582 Usage: F_oneway(*lists) where *lists is any number of lists, one per 1583 treatment group 1584 Returns: F value, one-tailed p-value 1585 """ 1586 a = len(lists) # ANOVA on 'a' groups, each in it's own list 1587 means = [0] * a 1588 vars = [0] * a 1589 ns = [0] * a 1590 alldata = [] 1591 tmp = map(N.array, lists) 1592 means = map(amean, tmp) 1593 vars = map(avar, tmp) 1594 ns = map(len, lists) 1595 for i in range(len(lists)): 1596 alldata = alldata + lists[i] 1597 alldata = N.array(alldata) 1598 bign = len(alldata) 1599 sstot = ass(alldata) - (asquare_of_sums(alldata) / float(bign)) 1600 ssbn = 0 1601 for list in lists: 1602 ssbn = ssbn + asquare_of_sums(N.array(list)) / float(len(list)) 1603 ssbn = ssbn - (asquare_of_sums(alldata) / float(bign)) 1604 sswn = sstot - ssbn 1605 dfbn = a - 1 1606 dfwn = bign - a 1607 msb = ssbn / float(dfbn) 1608 msw = sswn / float(dfwn) 1609 f = msb / msw 1610 prob = fprob(dfbn, dfwn, f) 1611 return f, prob 1612 1613 1614 def lF_value(ER, EF, dfnum, dfden): 1615 """ 1616 Returns an F-statistic given the following: 1617 ER = error associated with the null hypothesis (the Restricted model) 1618 EF = error associated with the alternate hypothesis (the Full model) 1619 dfR-dfF = degrees of freedom of the numerator 1620 dfF = degrees of freedom associated with the denominator/Full model 1621 1622 Usage: lF_value(ER,EF,dfnum,dfden) 1623 """ 1624 return ((ER - EF) / float(dfnum) / (EF / float(dfden))) 1625 1626 #################################### 1627 ######## SUPPORT FUNCTIONS ####### 1628 #################################### 1629 1630 1631 def writecc(listoflists, file, writetype='w', extra=2): 1632 """ 1633 Writes a list of lists to a file in columns, customized by the max 1634 size of items within the columns (max size of items in col, +2 characters) 1635 to specified file. File-overwrite is the default. 1636 1637 Usage: writecc (listoflists,file,writetype='w',extra=2) 1638 Returns: None 1639 """ 1640 if type(listoflists[0]) not in [ListType, TupleType]: 1641 listoflists = [listoflists] 1642 outfile = open(file, writetype) 1643 rowstokill = [] 1644 list2print = copy.deepcopy(listoflists) 1645 for i in range(len(listoflists)): 1646 if listoflists[i] == [ 1647 '\n' 1648 ] or listoflists[i] == '\n' or listoflists[i] == 'dashes': 1649 rowstokill = rowstokill + [i] 1650 rowstokill.reverse() 1651 for row in rowstokill: 1652 del list2print[row] 1653 maxsize = [0] * len(list2print[0]) 1654 for col in range(len(list2print[0])): 1655 items = pstat.colex(list2print, col) 1656 items = map(pstat.makestr, items) 1657 maxsize[col] = max(map(len, items)) + extra 1658 for row in listoflists: 1659 if row == ['\n'] or row == '\n': 1660 outfile.write('\n') 1661 elif row == ['dashes'] or row == 'dashes': 1662 dashes = [0] * len(maxsize) 1663 for j in range(len(maxsize)): 1664 dashes[j] = '-' * (maxsize[j] - 2) 1665 outfile.write(pstat.lineincustcols(dashes, maxsize)) 1666 else: 1667 outfile.write(pstat.lineincustcols(row, maxsize)) 1668 outfile.write('\n') 1669 outfile.close() 1670 return None 1671 1672 1673 def lincr(l, cap): # to increment a list up to a max-list of 'cap' 1674 """ 1675 Simulate a counting system from an n-dimensional list. 1676 1677 Usage: lincr(l,cap) l=list to increment, cap=max values for each list pos'n 1678 Returns: next set of values for list l, OR -1 (if overflow) 1679 """ 1680 l[0] = l[0] + 1 # e.g., [0,0,0] --> [2,4,3] (=cap) 1681 for i in range(len(l)): 1682 if l[i] > cap[i] and i < len(l) - 1: # if carryover AND not done 1683 l[i] = 0 1684 l[i + 1] = l[i + 1] + 1 1685 elif l[i] > cap[i] and i == len( 1686 l) - 1: # overflow past last column, must be finished 1687 l = -1 1688 return l 1689 1690 1691 def lsum(inlist): 1692 """ 1693 Returns the sum of the items in the passed list. 1694 1695 Usage: lsum(inlist) 1696 """ 1697 s = 0 1698 for item in inlist: 1699 s = s + item 1700 return s 1701 1702 1703 def lcumsum(inlist): 1704 """ 1705 Returns a list consisting of the cumulative sum of the items in the 1706 passed list. 1707 1708 Usage: lcumsum(inlist) 1709 """ 1710 newlist = copy.deepcopy(inlist) 1711 for i in range(1, len(newlist)): 1712 newlist[i] = newlist[i] + newlist[i - 1] 1713 return newlist 1714 1715 1716 def lss(inlist): 1717 """ 1718 Squares each value in the passed list, adds up these squares and 1719 returns the result. 1720 1721 Usage: lss(inlist) 1722 """ 1723 ss = 0 1724 for item in inlist: 1725 ss = ss + item * item 1726 return ss 1727 1728 1729 def lsummult(list1, list2): 1730 """ 1731 Multiplies elements in list1 and list2, element by element, and 1732 returns the sum of all resulting multiplications. Must provide equal 1733 length lists. 1734 1735 Usage: lsummult(list1,list2) 1736 """ 1737 if len(list1) <> len(list2): 1738 raise ValueError, 'Lists not equal length in summult.' 1739 s = 0 1740 for item1, item2 in pstat.abut(list1, list2): 1741 s = s + item1 * item2 1742 return s 1743 1744 1745 def lsumdiffsquared(x, y): 1746 """ 1747 Takes pairwise differences of the values in lists x and y, squares 1748 these differences, and returns the sum of these squares. 1749 1750 Usage: lsumdiffsquared(x,y) 1751 Returns: sum[(x[i]-y[i])**2] 1752 """ 1753 sds = 0 1754 for i in range(len(x)): 1755 sds = sds + (x[i] - y[i])**2 1756 return sds 1757 1758 1759 def lsquare_of_sums(inlist): 1760 """ 1761 Adds the values in the passed list, squares the sum, and returns 1762 the result. 1763 1764 Usage: lsquare_of_sums(inlist) 1765 Returns: sum(inlist[i])**2 1766 """ 1767 s = sum(inlist) 1768 return float(s) * s 1769 1770 1771 def lshellsort(inlist): 1772 """ 1773 Shellsort algorithm. Sorts a 1D-list. 1774 1775 Usage: lshellsort(inlist) 1776 Returns: sorted-inlist, sorting-index-vector (for original list) 1777 """ 1778 n = len(inlist) 1779 svec = copy.deepcopy(inlist) 1780 ivec = range(n) 1781 gap = n / 2 # integer division needed 1782 while gap > 0: 1783 for i in range(gap, n): 1784 for j in range(i - gap, -1, -gap): 1785 while j >= 0 and svec[j] > svec[j + gap]: 1786 temp = svec[j] 1787 svec[j] = svec[j + gap] 1788 svec[j + gap] = temp 1789 itemp = ivec[j] 1790 ivec[j] = ivec[j + gap] 1791 ivec[j + gap] = itemp 1792 gap = gap / 2 # integer division needed 1793 # svec is now sorted inlist, and ivec has the order svec[i] = vec[ivec[i]] 1794 return svec, ivec 1795 1796 1797 def lrankdata(inlist): 1798 """ 1799 Ranks the data in inlist, dealing with ties appropritely. Assumes 1800 a 1D inlist. Adapted from Gary Perlman's |Stat ranksort. 1801 1802 Usage: lrankdata(inlist) 1803 Returns: a list of length equal to inlist, containing rank scores 1804 """ 1805 n = len(inlist) 1806 svec, ivec = shellsort(inlist) 1807 sumranks = 0 1808 dupcount = 0 1809 newlist = [0] * n 1810 for i in range(n): 1811 sumranks = sumranks + i 1812 dupcount = dupcount + 1 1813 if i == n - 1 or svec[i] <> svec[i + 1]: 1814 averank = sumranks / float(dupcount) + 1 1815 for j in range(i - dupcount + 1, i + 1): 1816 newlist[ivec[j]] = averank 1817 sumranks = 0 1818 dupcount = 0 1819 return newlist 1820 1821 1822 def outputpairedstats(fname, writemode, name1, n1, m1, se1, min1, max1, name2, 1823 n2, m2, se2, min2, max2, statname, stat, prob): 1824 """ 1825 Prints or write to a file stats for two groups, using the name, n, 1826 mean, sterr, min and max for each group, as well as the statistic name, 1827 its value, and the associated p-value. 1828 1829 Usage: outputpairedstats(fname,writemode, 1830 name1,n1,mean1,stderr1,min1,max1, 1831 name2,n2,mean2,stderr2,min2,max2, 1832 statname,stat,prob) 1833 Returns: None 1834 """ 1835 suffix = '' # for *s after the p-value 1836 try: 1837 x = prob.shape 1838 prob = prob[0] 1839 except: 1840 pass 1841 if prob < 0.001: 1842 suffix = ' ***' 1843 elif prob < 0.01: 1844 suffix = ' **' 1845 elif prob < 0.05: 1846 suffix = ' *' 1847 title = [['Name', 'N', 'Mean', 'SD', 'Min', 'Max']] 1848 lofl = title + [[name1, n1, round(m1, 3), round( 1849 math.sqrt(se1), 3), min1, max1], [name2, n2, round(m2, 3), round( 1850 math.sqrt(se2), 3), min2, max2]] 1851 if type(fname) <> StringType or len(fname) == 0: 1852 print 1853 print statname 1854 print 1855 pstat.printcc(lofl) 1856 print 1857 try: 1858 if stat.shape == (): 1859 stat = stat[0] 1860 if prob.shape == (): 1861 prob = prob[0] 1862 except: 1863 pass 1864 print 'Test statistic = ', round(stat, 3), ' p = ', round(prob, 3), suffix 1865 print 1866 else: 1867 file = open(fname, writemode) 1868 file.write('\n' + statname + '\n\n') 1869 file.close() 1870 writecc(lofl, fname, 'a') 1871 file = open(fname, 'a') 1872 try: 1873 if stat.shape == (): 1874 stat = stat[0] 1875 if prob.shape == (): 1876 prob = prob[0] 1877 except: 1878 pass 1879 file.write(pstat.list2string(['\nTest statistic = ', round(stat, 4), 1880 ' p = ', round(prob, 4), suffix, '\n\n'])) 1881 file.close() 1882 return None 1883 1884 1885 def lfindwithin(data): 1886 """ 1887 Returns an integer representing a binary vector, where 1=within- 1888 subject factor, 0=between. Input equals the entire data 2D list (i.e., 1889 column 0=random factor, column -1=measured values (those two are skipped). 1890 Note: input data is in |Stat format ... a list of lists ("2D list") with 1891 one row per measured value, first column=subject identifier, last column= 1892 score, one in-between column per factor (these columns contain level 1893 designations on each factor). See also stats.anova.__doc__. 1894 1895 Usage: lfindwithin(data) data in |Stat format 1896 """ 1897 1898 numfact = len(data[0]) - 1 1899 withinvec = 0 1900 for col in range(1, numfact): 1901 examplelevel = pstat.unique(pstat.colex(data, col))[0] 1902 rows = pstat.linexand(data, col, examplelevel) # get 1 level of this factor 1903 factsubjs = pstat.unique(pstat.colex(rows, 0)) 1904 allsubjs = pstat.unique(pstat.colex(data, 0)) 1905 if len(factsubjs) == len(allsubjs): # fewer Ss than scores on this factor? 1906 withinvec = withinvec + (1 << col) 1907 return withinvec 1908 1909 ######################################################### 1910 ######################################################### 1911 ####### DISPATCH LISTS AND TUPLES TO ABOVE FCNS ######### 1912 ######################################################### 1913 ######################################################### 1914 1915 ## CENTRAL TENDENCY: 1916 geometricmean = Dispatch((lgeometricmean, (ListType, TupleType)),) 1917 harmonicmean = Dispatch((lharmonicmean, (ListType, TupleType)),) 1918 mean = Dispatch((lmean, (ListType, TupleType)),) 1919 median = Dispatch((lmedian, (ListType, TupleType)),) 1920 medianscore = Dispatch((lmedianscore, (ListType, TupleType)),) 1921 mode = Dispatch((lmode, (ListType, TupleType)),) 1922 1923 ## MOMENTS: 1924 moment = Dispatch((lmoment, (ListType, TupleType)),) 1925 variation = Dispatch((lvariation, (ListType, TupleType)),) 1926 skew = Dispatch((lskew, (ListType, TupleType)),) 1927 kurtosis = Dispatch((lkurtosis, (ListType, TupleType)),) 1928 describe = Dispatch((ldescribe, (ListType, TupleType)),) 1929 1930 ## FREQUENCY STATISTICS: 1931 itemfreq = Dispatch((litemfreq, (ListType, TupleType)),) 1932 scoreatpercentile = Dispatch((lscoreatpercentile, (ListType, TupleType)),) 1933 percentileofscore = Dispatch((lpercentileofscore, (ListType, TupleType)),) 1934 histogram = Dispatch((lhistogram, (ListType, TupleType)),) 1935 cumfreq = Dispatch((lcumfreq, (ListType, TupleType)),) 1936 relfreq = Dispatch((lrelfreq, (ListType, TupleType)),) 1937 1938 ## VARIABILITY: 1939 obrientransform = Dispatch((lobrientransform, (ListType, TupleType)),) 1940 samplevar = Dispatch((lsamplevar, (ListType, TupleType)),) 1941 samplestdev = Dispatch((lsamplestdev, (ListType, TupleType)),) 1942 var = Dispatch((lvar, (ListType, TupleType)),) 1943 stdev = Dispatch((lstdev, (ListType, TupleType)),) 1944 sterr = Dispatch((lsterr, (ListType, TupleType)),) 1945 sem = Dispatch((lsem, (ListType, TupleType)),) 1946 z = Dispatch((lz, (ListType, TupleType)),) 1947 zs = Dispatch((lzs, (ListType, TupleType)),) 1948 1949 ## TRIMMING FCNS: 1950 trimboth = Dispatch((ltrimboth, (ListType, TupleType)),) 1951 trim1 = Dispatch((ltrim1, (ListType, TupleType)),) 1952 1953 ## CORRELATION FCNS: 1954 paired = Dispatch((lpaired, (ListType, TupleType)),) 1955 pearsonr = Dispatch((lpearsonr, (ListType, TupleType)),) 1956 spearmanr = Dispatch((lspearmanr, (ListType, TupleType)),) 1957 pointbiserialr = Dispatch((lpointbiserialr, (ListType, TupleType)),) 1958 kendalltau = Dispatch((lkendalltau, (ListType, TupleType)),) 1959 linregress = Dispatch((llinregress, (ListType, TupleType)),) 1960 1961 ## INFERENTIAL STATS: 1962 ttest_1samp = Dispatch((lttest_1samp, (ListType, TupleType)),) 1963 ttest_ind = Dispatch((lttest_ind, (ListType, TupleType)),) 1964 ttest_rel = Dispatch((lttest_rel, (ListType, TupleType)),) 1965 chisquare = Dispatch((lchisquare, (ListType, TupleType)),) 1966 ks_2samp = Dispatch((lks_2samp, (ListType, TupleType)),) 1967 mannwhitneyu = Dispatch((lmannwhitneyu, (ListType, TupleType)),) 1968 ranksums = Dispatch((lranksums, (ListType, TupleType)),) 1969 tiecorrect = Dispatch((ltiecorrect, (ListType, TupleType)),) 1970 wilcoxont = Dispatch((lwilcoxont, (ListType, TupleType)),) 1971 kruskalwallish = Dispatch((lkruskalwallish, (ListType, TupleType)),) 1972 friedmanchisquare = Dispatch((lfriedmanchisquare, (ListType, TupleType)),) 1973 1974 ## PROBABILITY CALCS: 1975 chisqprob = Dispatch((lchisqprob, (IntType, FloatType)),) 1976 zprob = Dispatch((lzprob, (IntType, FloatType)),) 1977 ksprob = Dispatch((lksprob, (IntType, FloatType)),) 1978 fprob = Dispatch((lfprob, (IntType, FloatType)),) 1979 betacf = Dispatch((lbetacf, (IntType, FloatType)),) 1980 betai = Dispatch((lbetai, (IntType, FloatType)),) 1981 erfcc = Dispatch((lerfcc, (IntType, FloatType)),) 1982 gammln = Dispatch((lgammln, (IntType, FloatType)),) 1983 1984 ## ANOVA FUNCTIONS: 1985 F_oneway = Dispatch((lF_oneway, (ListType, TupleType)),) 1986 F_value = Dispatch((lF_value, (ListType, TupleType)),) 1987 1988 ## SUPPORT FUNCTIONS: 1989 incr = Dispatch((lincr, (ListType, TupleType)),) 1990 sum = Dispatch((lsum, (ListType, TupleType)),) 1991 cumsum = Dispatch((lcumsum, (ListType, TupleType)),) 1992 ss = Dispatch((lss, (ListType, TupleType)),) 1993 summult = Dispatch((lsummult, (ListType, TupleType)),) 1994 square_of_sums = Dispatch((lsquare_of_sums, (ListType, TupleType)),) 1995 sumdiffsquared = Dispatch((lsumdiffsquared, (ListType, TupleType)),) 1996 shellsort = Dispatch((lshellsort, (ListType, TupleType)),) 1997 rankdata = Dispatch((lrankdata, (ListType, TupleType)),) 1998 findwithin = Dispatch((lfindwithin, (ListType, TupleType)),) 1999 2000 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2001 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2002 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2003 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2004 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2005 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2006 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2007 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2008 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2009 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2010 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2011 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2012 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2013 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2014 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2015 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2016 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2017 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2018 #============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2019 2020 try: # DEFINE THESE *ONLY* IF NUMERIC IS AVAILABLE 2021 import numpy as N 2022 import numpy.linalg as LA 2023 2024 ##################################### 2025 ######## ACENTRAL TENDENCY ######## 2026 ##################################### 2027 2028 2029 def ageometricmean(inarray, dimension=None, keepdims=0): 2030 """ 2031 Calculates the geometric mean of the values in the passed array. 2032 That is: n-th root of (x1 * x2 * ... * xn). Defaults to ALL values in 2033 the passed array. Use dimension=None to flatten array first. REMEMBER: if 2034 dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2035 if dimension is a sequence, it collapses over all specified dimensions. If 2036 keepdims is set to 1, the resulting array will have as many dimensions as 2037 inarray, with only 1 'level' per dim that was collapsed over. 2038 2039 Usage: ageometricmean(inarray,dimension=None,keepdims=0) 2040 Returns: geometric mean computed over dim(s) listed in dimension 2041 """ 2042 inarray = N.array(inarray, N.float_) 2043 if dimension == None: 2044 inarray = N.ravel(inarray) 2045 size = len(inarray) 2046 mult = N.power(inarray, 1.0 / size) 2047 mult = N.multiply.reduce(mult) 2048 elif type(dimension) in [IntType, FloatType]: 2049 size = inarray.shape[dimension] 2050 mult = N.power(inarray, 1.0 / size) 2051 mult = N.multiply.reduce(mult, dimension) 2052 if keepdims == 1: 2053 shp = list(inarray.shape) 2054 shp[dimension] = 1 2055 sum = N.reshape(sum, shp) 2056 else: # must be a SEQUENCE of dims to average over 2057 dims = list(dimension) 2058 dims.sort() 2059 dims.reverse() 2060 size = N.array(N.multiply.reduce(N.take(inarray.shape, dims)), N.float_) 2061 mult = N.power(inarray, 1.0 / size) 2062 for dim in dims: 2063 mult = N.multiply.reduce(mult, dim) 2064 if keepdims == 1: 2065 shp = list(inarray.shape) 2066 for dim in dims: 2067 shp[dim] = 1 2068 mult = N.reshape(mult, shp) 2069 return mult 2070 2071 def aharmonicmean(inarray, dimension=None, keepdims=0): 2072 """ 2073 Calculates the harmonic mean of the values in the passed array. 2074 That is: n / (1/x1 + 1/x2 + ... + 1/xn). Defaults to ALL values in 2075 the passed array. Use dimension=None to flatten array first. REMEMBER: if 2076 dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2077 if dimension is a sequence, it collapses over all specified dimensions. If 2078 keepdims is set to 1, the resulting array will have as many dimensions as 2079 inarray, with only 1 'level' per dim that was collapsed over. 2080 2081 Usage: aharmonicmean(inarray,dimension=None,keepdims=0) 2082 Returns: harmonic mean computed over dim(s) in dimension 2083 """ 2084 inarray = inarray.astype(N.float_) 2085 if dimension == None: 2086 inarray = N.ravel(inarray) 2087 size = len(inarray) 2088 s = N.add.reduce(1.0 / inarray) 2089 elif type(dimension) in [IntType, FloatType]: 2090 size = float(inarray.shape[dimension]) 2091 s = N.add.reduce(1.0 / inarray, dimension) 2092 if keepdims == 1: 2093 shp = list(inarray.shape) 2094 shp[dimension] = 1 2095 s = N.reshape(s, shp) 2096 else: # must be a SEQUENCE of dims to average over 2097 dims = list(dimension) 2098 dims.sort() 2099 nondims = [] 2100 for i in range(len(inarray.shape)): 2101 if i not in dims: 2102 nondims.append(i) 2103 tinarray = N.transpose(inarray, nondims + dims) # put keep-dims first 2104 idx = [0] * len(nondims) 2105 if idx == []: 2106 size = len(N.ravel(inarray)) 2107 s = asum(1.0 / inarray) 2108 if keepdims == 1: 2109 s = N.reshape([s], N.ones(len(inarray.shape))) 2110 else: 2111 idx[0] = -1 2112 loopcap = N.array(tinarray.shape[0:len(nondims)]) - 1 2113 s = N.zeros(loopcap + 1, N.float_) 2114 while incr(idx, loopcap) <> -1: 2115 s[idx] = asum(1.0 / tinarray[idx]) 2116 size = N.multiply.reduce(N.take(inarray.shape, dims)) 2117 if keepdims == 1: 2118 shp = list(inarray.shape) 2119 for dim in dims: 2120 shp[dim] = 1 2121 s = N.reshape(s, shp) 2122 return size / s 2123 2124 def amean(inarray, dimension=None, keepdims=0): 2125 """ 2126 Calculates the arithmatic mean of the values in the passed array. 2127 That is: 1/n * (x1 + x2 + ... + xn). Defaults to ALL values in the 2128 passed array. Use dimension=None to flatten array first. REMEMBER: if 2129 dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2130 if dimension is a sequence, it collapses over all specified dimensions. If 2131 keepdims is set to 1, the resulting array will have as many dimensions as 2132 inarray, with only 1 'level' per dim that was collapsed over. 2133 2134 Usage: amean(inarray,dimension=None,keepdims=0) 2135 Returns: arithematic mean calculated over dim(s) in dimension 2136 """ 2137 if inarray.dtype in [N.int_, N.short, N.ubyte]: 2138 inarray = inarray.astype(N.float_) 2139 if dimension == None: 2140 inarray = N.ravel(inarray) 2141 sum = N.add.reduce(inarray) 2142 denom = float(len(inarray)) 2143 elif type(dimension) in [IntType, FloatType]: 2144 sum = asum(inarray, dimension) 2145 denom = float(inarray.shape[dimension]) 2146 if keepdims == 1: 2147 shp = list(inarray.shape) 2148 shp[dimension] = 1 2149 sum = N.reshape(sum, shp) 2150 else: # must be a TUPLE of dims to average over 2151 dims = list(dimension) 2152 dims.sort() 2153 dims.reverse() 2154 sum = inarray * 1.0 2155 for dim in dims: 2156 sum = N.add.reduce(sum, dim) 2157 denom = N.array(N.multiply.reduce(N.take(inarray.shape, dims)), N.float_) 2158 if keepdims == 1: 2159 shp = list(inarray.shape) 2160 for dim in dims: 2161 shp[dim] = 1 2162 sum = N.reshape(sum, shp) 2163 return sum / denom 2164 2165 def amedian(inarray, numbins=1000): 2166 """ 2167 Calculates the COMPUTED median value of an array of numbers, given the 2168 number of bins to use for the histogram (more bins approaches finding the 2169 precise median value of the array; default number of bins = 1000). From 2170 G.W. Heiman's Basic Stats, or CRC Probability & Statistics. 2171 NOTE: THIS ROUTINE ALWAYS uses the entire passed array (flattens it first). 2172 2173 Usage: amedian(inarray,numbins=1000) 2174 Returns: median calculated over ALL values in inarray 2175 """ 2176 inarray = N.ravel(inarray) 2177 (hist, smallest, binsize, extras) = ahistogram(inarray, numbins, 2178 [min(inarray), max(inarray)]) 2179 cumhist = N.cumsum(hist) # make cumulative histogram 2180 otherbins = N.greater_equal(cumhist, len(inarray) / 2.0) 2181 otherbins = list(otherbins) # list of 0/1s, 1s start at median bin 2182 cfbin = otherbins.index(1) # get 1st(!) index holding 50%ile score 2183 LRL = smallest + binsize * cfbin # get lower read limit of that bin 2184 cfbelow = N.add.reduce(hist[0:cfbin]) # cum. freq. below bin 2185 freq = hist[cfbin] # frequency IN the 50%ile bin 2186 median = LRL + ( 2187 (len(inarray) / 2.0 - cfbelow) / float(freq)) * binsize # MEDIAN 2188 return median 2189 2190 def amedianscore(inarray, dimension=None): 2191 """ 2192 Returns the 'middle' score of the passed array. If there is an even 2193 number of scores, the mean of the 2 middle scores is returned. Can function 2194 with 1D arrays, or on the FIRST dimension of 2D arrays (i.e., dimension can 2195 be None, to pre-flatten the array, or else dimension must equal 0). 2196 2197 Usage: amedianscore(inarray,dimension=None) 2198 Returns: 'middle' score of the array, or the mean of the 2 middle scores 2199 """ 2200 if dimension == None: 2201 inarray = N.ravel(inarray) 2202 dimension = 0 2203 inarray = N.sort(inarray, dimension) 2204 if inarray.shape[dimension] % 2 == 0: # if even number of elements 2205 indx = inarray.shape[dimension] / 2 # integer division correct 2206 median = N.asarray(inarray[indx] + inarray[indx - 1]) / 2.0 2207 else: 2208 indx = inarray.shape[dimension] / 2 # integer division correct 2209 median = N.take(inarray, [indx], dimension) 2210 if median.shape == (1,): 2211 median = median[0] 2212 return median 2213 2214 def amode(a, dimension=None): 2215 """ 2216 Returns an array of the modal (most common) score in the passed array. 2217 If there is more than one such score, ONLY THE FIRST is returned. 2218 The bin-count for the modal values is also returned. Operates on whole 2219 array (dimension=None), or on a given dimension. 2220 2221 Usage: amode(a, dimension=None) 2222 Returns: array of bin-counts for mode(s), array of corresponding modal values 2223 """ 2224 2225 if dimension == None: 2226 a = N.ravel(a) 2227 dimension = 0 2228 scores = pstat.aunique(N.ravel(a)) # get ALL unique values 2229 testshape = list(a.shape) 2230 testshape[dimension] = 1 2231 oldmostfreq = N.zeros(testshape) 2232 oldcounts = N.zeros(testshape) 2233 for score in scores: 2234 template = N.equal(a, score) 2235 counts = asum(template, dimension, 1) 2236 mostfrequent = N.where(counts > oldcounts, score, oldmostfreq) 2237 oldcounts = N.where(counts > oldcounts, counts, oldcounts) 2238 oldmostfreq = mostfrequent 2239 return oldcounts, mostfrequent 2240 2241 def atmean(a, limits=None, inclusive=(1, 1)): 2242 """ 2243 Returns the arithmetic mean of all values in an array, ignoring values 2244 strictly outside the sequence passed to 'limits'. Note: either limit 2245 in the sequence, or the value of limits itself, can be set to None. The 2246 inclusive list/tuple determines whether the lower and upper limiting bounds 2247 (respectively) are open/exclusive (0) or closed/inclusive (1). 2248 2249 Usage: atmean(a,limits=None,inclusive=(1,1)) 2250 """ 2251 if a.dtype in [N.int_, N.short, N.ubyte]: 2252 a = a.astype(N.float_) 2253 if limits == None: 2254 return mean(a) 2255 assert type(limits) in [ListType, TupleType, N.ndarray 2256 ], 'Wrong type for limits in atmean' 2257 if inclusive[0]: 2258 lowerfcn = N.greater_equal 2259 else: 2260 lowerfcn = N.greater 2261 if inclusive[1]: 2262 upperfcn = N.less_equal 2263 else: 2264 upperfcn = N.less 2265 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce( 2266 N.ravel(a)): 2267 raise ValueError, 'No array values within given limits (atmean).' 2268 elif limits[0] == None and limits[1] <> None: 2269 mask = upperfcn(a, limits[1]) 2270 elif limits[0] <> None and limits[1] == None: 2271 mask = lowerfcn(a, limits[0]) 2272 elif limits[0] <> None and limits[1] <> None: 2273 mask = lowerfcn(a, limits[0]) * upperfcn(a, limits[1]) 2274 s = float(N.add.reduce(N.ravel(a * mask))) 2275 n = float(N.add.reduce(N.ravel(mask))) 2276 return s / n 2277 2278 def atvar(a, limits=None, inclusive=(1, 1)): 2279 """ 2280 Returns the sample variance of values in an array, (i.e., using N-1), 2281 ignoring values strictly outside the sequence passed to 'limits'. 2282 Note: either limit in the sequence, or the value of limits itself, 2283 can be set to None. The inclusive list/tuple determines whether the lower 2284 and upper limiting bounds (respectively) are open/exclusive (0) or 2285 closed/inclusive (1). ASSUMES A FLAT ARRAY (OR ELSE PREFLATTENS). 2286 2287 Usage: atvar(a,limits=None,inclusive=(1,1)) 2288 """ 2289 a = a.astype(N.float_) 2290 if limits == None or limits == [None, None]: 2291 return avar(a) 2292 assert type(limits) in [ListType, TupleType, N.ndarray 2293 ], 'Wrong type for limits in atvar' 2294 if inclusive[0]: 2295 lowerfcn = N.greater_equal 2296 else: 2297 lowerfcn = N.greater 2298 if inclusive[1]: 2299 upperfcn = N.less_equal 2300 else: 2301 upperfcn = N.less 2302 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce( 2303 N.ravel(a)): 2304 raise ValueError, 'No array values within given limits (atvar).' 2305 elif limits[0] == None and limits[1] <> None: 2306 mask = upperfcn(a, limits[1]) 2307 elif limits[0] <> None and limits[1] == None: 2308 mask = lowerfcn(a, limits[0]) 2309 elif limits[0] <> None and limits[1] <> None: 2310 mask = lowerfcn(a, limits[0]) * upperfcn(a, limits[1]) 2311 2312 a = N.compress(mask, a) # squish out excluded values 2313 return avar(a) 2314 2315 def atmin(a, lowerlimit=None, dimension=None, inclusive=1): 2316 """ 2317 Returns the minimum value of a, along dimension, including only values less 2318 than (or equal to, if inclusive=1) lowerlimit. If the limit is set to None, 2319 all values in the array are used. 2320 2321 Usage: atmin(a,lowerlimit=None,dimension=None,inclusive=1) 2322 """ 2323 if inclusive: 2324 lowerfcn = N.greater 2325 else: 2326 lowerfcn = N.greater_equal 2327 if dimension == None: 2328 a = N.ravel(a) 2329 dimension = 0 2330 if lowerlimit == None: 2331 lowerlimit = N.minimum.reduce(N.ravel(a)) - 11 2332 biggest = N.maximum.reduce(N.ravel(a)) 2333 ta = N.where(lowerfcn(a, lowerlimit), a, biggest) 2334 return N.minimum.reduce(ta, dimension) 2335 2336 def atmax(a, upperlimit, dimension=None, inclusive=1): 2337 """ 2338 Returns the maximum value of a, along dimension, including only values greater 2339 than (or equal to, if inclusive=1) upperlimit. If the limit is set to None, 2340 a limit larger than the max value in the array is used. 2341 2342 Usage: atmax(a,upperlimit,dimension=None,inclusive=1) 2343 """ 2344 if inclusive: 2345 upperfcn = N.less 2346 else: 2347 upperfcn = N.less_equal 2348 if dimension == None: 2349 a = N.ravel(a) 2350 dimension = 0 2351 if upperlimit == None: 2352 upperlimit = N.maximum.reduce(N.ravel(a)) + 1 2353 smallest = N.minimum.reduce(N.ravel(a)) 2354 ta = N.where(upperfcn(a, upperlimit), a, smallest) 2355 return N.maximum.reduce(ta, dimension) 2356 2357 def atstdev(a, limits=None, inclusive=(1, 1)): 2358 """ 2359 Returns the standard deviation of all values in an array, ignoring values 2360 strictly outside the sequence passed to 'limits'. Note: either limit 2361 in the sequence, or the value of limits itself, can be set to None. The 2362 inclusive list/tuple determines whether the lower and upper limiting bounds 2363 (respectively) are open/exclusive (0) or closed/inclusive (1). 2364 2365 Usage: atstdev(a,limits=None,inclusive=(1,1)) 2366 """ 2367 return N.sqrt(tvar(a, limits, inclusive)) 2368 2369 def atsem(a, limits=None, inclusive=(1, 1)): 2370 """ 2371 Returns the standard error of the mean for the values in an array, 2372 (i.e., using N for the denominator), ignoring values strictly outside 2373 the sequence passed to 'limits'. Note: either limit in the sequence, 2374 or the value of limits itself, can be set to None. The inclusive list/tuple 2375 determines whether the lower and upper limiting bounds (respectively) are 2376 open/exclusive (0) or closed/inclusive (1). 2377 2378 Usage: atsem(a,limits=None,inclusive=(1,1)) 2379 """ 2380 sd = tstdev(a, limits, inclusive) 2381 if limits == None or limits == [None, None]: 2382 n = float(len(N.ravel(a))) 2383 limits = [min(a) - 1, max(a) + 1] 2384 assert type(limits) in [ListType, TupleType, N.ndarray 2385 ], 'Wrong type for limits in atsem' 2386 if inclusive[0]: 2387 lowerfcn = N.greater_equal 2388 else: 2389 lowerfcn = N.greater 2390 if inclusive[1]: 2391 upperfcn = N.less_equal 2392 else: 2393 upperfcn = N.less 2394 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce( 2395 N.ravel(a)): 2396 raise ValueError, 'No array values within given limits (atsem).' 2397 elif limits[0] == None and limits[1] <> None: 2398 mask = upperfcn(a, limits[1]) 2399 elif limits[0] <> None and limits[1] == None: 2400 mask = lowerfcn(a, limits[0]) 2401 elif limits[0] <> None and limits[1] <> None: 2402 mask = lowerfcn(a, limits[0]) * upperfcn(a, limits[1]) 2403 term1 = N.add.reduce(N.ravel(a * a * mask)) 2404 n = float(N.add.reduce(N.ravel(mask))) 2405 return sd / math.sqrt(n) 2406 2407 ##################################### 2408 ############ AMOMENTS ############# 2409 ##################################### 2410 2411 def amoment(a, moment=1, dimension=None): 2412 """ 2413 Calculates the nth moment about the mean for a sample (defaults to the 2414 1st moment). Generally used to calculate coefficients of skewness and 2415 kurtosis. Dimension can equal None (ravel array first), an integer 2416 (the dimension over which to operate), or a sequence (operate over 2417 multiple dimensions). 2418 2419 Usage: amoment(a,moment=1,dimension=None) 2420 Returns: appropriate moment along given dimension 2421 """ 2422 if dimension == None: 2423 a = N.ravel(a) 2424 dimension = 0 2425 if moment == 1: 2426 return 0.0 2427 else: 2428 mn = amean(a, dimension, 1) # 1=keepdims 2429 s = N.power((a - mn), moment) 2430 return amean(s, dimension) 2431 2432 def avariation(a, dimension=None): 2433 """ 2434 Returns the coefficient of variation, as defined in CRC Standard 2435 Probability and Statistics, p.6. Dimension can equal None (ravel array 2436 first), an integer (the dimension over which to operate), or a 2437 sequence (operate over multiple dimensions). 2438 2439 Usage: avariation(a,dimension=None) 2440 """ 2441 return 100.0 * asamplestdev(a, dimension) / amean(a, dimension) 2442 2443 def askew(a, dimension=None): 2444 """ 2445 Returns the skewness of a distribution (normal ==> 0.0; >0 means extra 2446 weight in left tail). Use askewtest() to see if it's close enough. 2447 Dimension can equal None (ravel array first), an integer (the 2448 dimension over which to operate), or a sequence (operate over multiple 2449 dimensions). 2450 2451 Usage: askew(a, dimension=None) 2452 Returns: skew of vals in a along dimension, returning ZERO where all vals equal 2453 """ 2454 denom = N.power(amoment(a, 2, dimension), 1.5) 2455 zero = N.equal(denom, 0) 2456 if type(denom) == N.ndarray and asum(zero) <> 0: 2457 print 'Number of zeros in askew: ', asum(zero) 2458 denom = denom + zero # prevent divide-by-zero 2459 return N.where(zero, 0, amoment(a, 3, dimension) / denom) 2460 2461 def akurtosis(a, dimension=None): 2462 """ 2463 Returns the kurtosis of a distribution (normal ==> 3.0; >3 means 2464 heavier in the tails, and usually more peaked). Use akurtosistest() 2465 to see if it's close enough. Dimension can equal None (ravel array 2466 first), an integer (the dimension over which to operate), or a 2467 sequence (operate over multiple dimensions). 2468 2469 Usage: akurtosis(a,dimension=None) 2470 Returns: kurtosis of values in a along dimension, and ZERO where all vals equal 2471 """ 2472 denom = N.power(amoment(a, 2, dimension), 2) 2473 zero = N.equal(denom, 0) 2474 if type(denom) == N.ndarray and asum(zero) <> 0: 2475 print 'Number of zeros in akurtosis: ', asum(zero) 2476 denom = denom + zero # prevent divide-by-zero 2477 return N.where(zero, 0, amoment(a, 4, dimension) / denom) 2478 2479 def adescribe(inarray, dimension=None): 2480 """ 2481 Returns several descriptive statistics of the passed array. Dimension 2482 can equal None (ravel array first), an integer (the dimension over 2483 which to operate), or a sequence (operate over multiple dimensions). 2484 2485 Usage: adescribe(inarray,dimension=None) 2486 Returns: n, (min,max), mean, standard deviation, skew, kurtosis 2487 """ 2488 if dimension == None: 2489 inarray = N.ravel(inarray) 2490 dimension = 0 2491 n = inarray.shape[dimension] 2492 mm = (N.minimum.reduce(inarray), N.maximum.reduce(inarray)) 2493 m = amean(inarray, dimension) 2494 sd = astdev(inarray, dimension) 2495 skew = askew(inarray, dimension) 2496 kurt = akurtosis(inarray, dimension) 2497 return n, mm, m, sd, skew, kurt 2498 2499 ##################################### 2500 ######## NORMALITY TESTS ########## 2501 ##################################### 2502 2503 def askewtest(a, dimension=None): 2504 """ 2505 Tests whether the skew is significantly different from a normal 2506 distribution. Dimension can equal None (ravel array first), an 2507 integer (the dimension over which to operate), or a sequence (operate 2508 over multiple dimensions). 2509 2510 Usage: askewtest(a,dimension=None) 2511 Returns: z-score and 2-tail z-probability 2512 """ 2513 if dimension == None: 2514 a = N.ravel(a) 2515 dimension = 0 2516 b2 = askew(a, dimension) 2517 n = float(a.shape[dimension]) 2518 y = b2 * N.sqrt(((n + 1) * (n + 3)) / (6.0 * (n - 2))) 2519 beta2 = (3.0 * (n * n + 27 * n - 70) * (n + 1) * 2520 (n + 3)) / ((n - 2.0) * (n + 5) * (n + 7) * (n + 9)) 2521 W2 = -1 + N.sqrt(2 * (beta2 - 1)) 2522 delta = 1 / N.sqrt(N.log(N.sqrt(W2))) 2523 alpha = N.sqrt(2 / (W2 - 1)) 2524 y = N.where(y == 0, 1, y) 2525 Z = delta * N.log(y / alpha + N.sqrt((y / alpha)**2 + 1)) 2526 return Z, (1.0 - zprob(Z)) * 2 2527 2528 def akurtosistest(a, dimension=None): 2529 """ 2530 Tests whether a dataset has normal kurtosis (i.e., 2531 kurtosis=3(n-1)/(n+1)) Valid only for n>20. Dimension can equal None 2532 (ravel array first), an integer (the dimension over which to operate), 2533 or a sequence (operate over multiple dimensions). 2534 2535 Usage: akurtosistest(a,dimension=None) 2536 Returns: z-score and 2-tail z-probability, returns 0 for bad pixels 2537 """ 2538 if dimension == None: 2539 a = N.ravel(a) 2540 dimension = 0 2541 n = float(a.shape[dimension]) 2542 if n < 20: 2543 print 'akurtosistest only valid for n>=20 ... continuing anyway, n=', n 2544 b2 = akurtosis(a, dimension) 2545 E = 3.0 * (n - 1) / (n + 1) 2546 varb2 = 24.0 * n * (n - 2) * (n - 3) / ((n + 1) * (n + 1) * (n + 3) * 2547 (n + 5)) 2548 x = (b2 - E) / N.sqrt(varb2) 2549 sqrtbeta1 = 6.0 * (n * n - 5 * n + 2) / ((n + 7) * (n + 9)) * N.sqrt( 2550 (6.0 * (n + 3) * (n + 5)) / (n * (n - 2) * (n - 3))) 2551 A = 6.0 + 8.0 / sqrtbeta1 * (2.0 / sqrtbeta1 + 2552 N.sqrt(1 + 4.0 / (sqrtbeta1**2))) 2553 term1 = 1 - 2 / (9.0 * A) 2554 denom = 1 + x * N.sqrt(2 / (A - 4.0)) 2555 denom = N.where(N.less(denom, 0), 99, denom) 2556 term2 = N.where( 2557 N.equal(denom, 0), term1, N.power( 2558 (1 - 2.0 / A) / denom, 1 / 3.0)) 2559 Z = (term1 - term2) / N.sqrt(2 / (9.0 * A)) 2560 Z = N.where(N.equal(denom, 99), 0, Z) 2561 return Z, (1.0 - zprob(Z)) * 2 2562 2563 def anormaltest(a, dimension=None): 2564 """ 2565 Tests whether skew and/OR kurtosis of dataset differs from normal 2566 curve. Can operate over multiple dimensions. Dimension can equal 2567 None (ravel array first), an integer (the dimension over which to 2568 operate), or a sequence (operate over multiple dimensions). 2569 2570 Usage: anormaltest(a,dimension=None) 2571 Returns: z-score and 2-tail probability 2572 """ 2573 if dimension == None: 2574 a = N.ravel(a) 2575 dimension = 0 2576 s, p = askewtest(a, dimension) 2577 k, p = akurtosistest(a, dimension) 2578 k2 = N.power(s, 2) + N.power(k, 2) 2579 return k2, achisqprob(k2, 2) 2580 2581 ##################################### 2582 ###### AFREQUENCY FUNCTIONS ####### 2583 ##################################### 2584 2585 def aitemfreq(a): 2586 """ 2587 Returns a 2D array of item frequencies. Column 1 contains item values, 2588 column 2 contains their respective counts. Assumes a 1D array is passed. 2589 @@@sorting OK? 2590 2591 Usage: aitemfreq(a) 2592 Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) 2593 """ 2594 scores = pstat.aunique(a) 2595 scores = N.sort(scores) 2596 freq = N.zeros(len(scores)) 2597 for i in range(len(scores)): 2598 freq[i] = N.add.reduce(N.equal(a, scores[i])) 2599 return N.array(pstat.aabut(scores, freq)) 2600 2601 def ascoreatpercentile(inarray, percent): 2602 """ 2603 Usage: ascoreatpercentile(inarray,percent) 0<percent<100 2604 Returns: score at given percentile, relative to inarray distribution 2605 """ 2606 percent = percent / 100.0 2607 targetcf = percent * len(inarray) 2608 h, lrl, binsize, extras = histogram(inarray) 2609 cumhist = cumsum(h * 1) 2610 for i in range(len(cumhist)): 2611 if cumhist[i] >= targetcf: 2612 break 2613 score = binsize * ( 2614 (targetcf - cumhist[i - 1]) / float(h[i])) + (lrl + binsize * i) 2615 return score 2616 2617 def apercentileofscore(inarray, score, histbins=10, defaultlimits=None): 2618 """ 2619 Note: result of this function depends on the values used to histogram 2620 the data(!). 2621 2622 Usage: apercentileofscore(inarray,score,histbins=10,defaultlimits=None) 2623 Returns: percentile-position of score (0-100) relative to inarray 2624 """ 2625 h, lrl, binsize, extras = histogram(inarray, histbins, defaultlimits) 2626 cumhist = cumsum(h * 1) 2627 i = int((score - lrl) / float(binsize)) 2628 pct = (cumhist[i - 1] + ((score - (lrl + binsize * i)) / float(binsize)) * 2629 h[i]) / float(len(inarray)) * 100 2630 return pct 2631 2632 def ahistogram(inarray, numbins=10, defaultlimits=None, printextras=1): 2633 """ 2634 Returns (i) an array of histogram bin counts, (ii) the smallest value 2635 of the histogram binning, and (iii) the bin width (the last 2 are not 2636 necessarily integers). Default number of bins is 10. Defaultlimits 2637 can be None (the routine picks bins spanning all the numbers in the 2638 inarray) or a 2-sequence (lowerlimit, upperlimit). Returns all of the 2639 following: array of bin values, lowerreallimit, binsize, extrapoints. 2640 2641 Usage: ahistogram(inarray,numbins=10,defaultlimits=None,printextras=1) 2642 Returns: (array of bin counts, bin-minimum, min-width, #-points-outside-range) 2643 """ 2644 inarray = N.ravel(inarray) # flatten any >1D arrays 2645 if (defaultlimits <> None): 2646 lowerreallimit = defaultlimits[0] 2647 upperreallimit = defaultlimits[1] 2648 binsize = (upperreallimit - lowerreallimit) / float(numbins) 2649 else: 2650 Min = N.minimum.reduce(inarray) 2651 Max = N.maximum.reduce(inarray) 2652 estbinwidth = float(Max - Min) / float(numbins) + 1e-6 2653 binsize = (Max - Min + estbinwidth) / float(numbins) 2654 lowerreallimit = Min - binsize / 2.0 #lower real limit,1st bin 2655 bins = N.zeros(numbins) 2656 extrapoints = 0 2657 for num in inarray: 2658 try: 2659 if (num - lowerreallimit) < 0: 2660 extrapoints = extrapoints + 1 2661 else: 2662 bintoincrement = int((num - lowerreallimit) / float(binsize)) 2663 bins[bintoincrement] = bins[bintoincrement] + 1 2664 except: # point outside lower/upper limits 2665 extrapoints = extrapoints + 1 2666 if (extrapoints > 0 and printextras == 1): 2667 print '\nPoints outside given histogram range =', extrapoints 2668 return (bins, lowerreallimit, binsize, extrapoints) 2669 2670 def acumfreq(a, numbins=10, defaultreallimits=None): 2671 """ 2672 Returns a cumulative frequency histogram, using the histogram function. 2673 Defaultreallimits can be None (use all data), or a 2-sequence containing 2674 lower and upper limits on values to include. 2675 2676 Usage: acumfreq(a,numbins=10,defaultreallimits=None) 2677 Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints 2678 """ 2679 h, l, b, e = histogram(a, numbins, defaultreallimits) 2680 cumhist = cumsum(h * 1) 2681 return cumhist, l, b, e 2682 2683 def arelfreq(a, numbins=10, defaultreallimits=None): 2684 """ 2685 Returns a relative frequency histogram, using the histogram function. 2686 Defaultreallimits can be None (use all data), or a 2-sequence containing 2687 lower and upper limits on values to include. 2688 2689 Usage: arelfreq(a,numbins=10,defaultreallimits=None) 2690 Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints 2691 """ 2692 h, l, b, e = histogram(a, numbins, defaultreallimits) 2693 h = N.array(h / float(a.shape[0])) 2694 return h, l, b, e 2695 2696 ##################################### 2697 ###### AVARIABILITY FUNCTIONS ##### 2698 ##################################### 2699 2700 def aobrientransform(*args): 2701 """ 2702 Computes a transform on input data (any number of columns). Used to 2703 test for homogeneity of variance prior to running one-way stats. Each 2704 array in *args is one level of a factor. If an F_oneway() run on the 2705 transformed data and found significant, variances are unequal. From 2706 Maxwell and Delaney, p.112. 2707 2708 Usage: aobrientransform(*args) *args = 1D arrays, one per level of factor 2709 Returns: transformed data for use in an ANOVA 2710 """ 2711 TINY = 1e-10 2712 k = len(args) 2713 n = N.zeros(k, N.float_) 2714 v = N.zeros(k, N.float_) 2715 m = N.zeros(k, N.float_) 2716 nargs = [] 2717 for i in range(k): 2718 nargs.append(args[i].astype(N.float_)) 2719 n[i] = float(len(nargs[i])) 2720 v[i] = var(nargs[i]) 2721 m[i] = mean(nargs[i]) 2722 for j in range(k): 2723 for i in range(n[j]): 2724 t1 = (n[j] - 1.5) * n[j] * (nargs[j][i] - m[j])**2 2725 t2 = 0.5 * v[j] * (n[j] - 1.0) 2726 t3 = (n[j] - 1.0) * (n[j] - 2.0) 2727 nargs[j][i] = (t1 - t2) / float(t3) 2728 check = 1 2729 for j in range(k): 2730 if v[j] - mean(nargs[j]) > TINY: 2731 check = 0 2732 if check <> 1: 2733 raise ValueError, 'Lack of convergence in obrientransform.' 2734 else: 2735 return N.array(nargs) 2736 2737 def asamplevar(inarray, dimension=None, keepdims=0): 2738 """ 2739 Returns the sample standard deviation of the values in the passed 2740 array (i.e., using N). Dimension can equal None (ravel array first), 2741 an integer (the dimension over which to operate), or a sequence 2742 (operate over multiple dimensions). Set keepdims=1 to return an array 2743 with the same number of dimensions as inarray. 2744 2745 Usage: asamplevar(inarray,dimension=None,keepdims=0) 2746 """ 2747 if dimension == None: 2748 inarray = N.ravel(inarray) 2749 dimension = 0 2750 if dimension == 1: 2751 mn = amean(inarray, dimension)[:, N.NewAxis] 2752 else: 2753 mn = amean(inarray, dimension, keepdims=1) 2754 deviations = inarray - mn 2755 if type(dimension) == ListType: 2756 n = 1 2757 for d in dimension: 2758 n = n * inarray.shape[d] 2759 else: 2760 n = inarray.shape[dimension] 2761 svar = ass(deviations, dimension, keepdims) / float(n) 2762 return svar 2763 2764 def asamplestdev(inarray, dimension=None, keepdims=0): 2765 """ 2766 Returns the sample standard deviation of the values in the passed 2767 array (i.e., using N). Dimension can equal None (ravel array first), 2768 an integer (the dimension over which to operate), or a sequence 2769 (operate over multiple dimensions). Set keepdims=1 to return an array 2770 with the same number of dimensions as inarray. 2771 2772 Usage: asamplestdev(inarray,dimension=None,keepdims=0) 2773 """ 2774 return N.sqrt(asamplevar(inarray, dimension, keepdims)) 2775 2776 def asignaltonoise(instack, dimension=0): 2777 """ 2778 Calculates signal-to-noise. Dimension can equal None (ravel array 2779 first), an integer (the dimension over which to operate), or a 2780 sequence (operate over multiple dimensions). 2781 2782 Usage: asignaltonoise(instack,dimension=0): 2783 Returns: array containing the value of (mean/stdev) along dimension, 2784 or 0 when stdev=0 2785 """ 2786 m = mean(instack, dimension) 2787 sd = stdev(instack, dimension) 2788 return N.where(sd == 0, 0, m / sd) 2789 2790 def acov(x, y, dimension=None, keepdims=0): 2791 """ 2792 Returns the estimated covariance of the values in the passed 2793 array (i.e., N-1). Dimension can equal None (ravel array first), an 2794 integer (the dimension over which to operate), or a sequence (operate 2795 over multiple dimensions). Set keepdims=1 to return an array with the 2796 same number of dimensions as inarray. 2797 2798 Usage: acov(x,y,dimension=None,keepdims=0) 2799 """ 2800 if dimension == None: 2801 x = N.ravel(x) 2802 y = N.ravel(y) 2803 dimension = 0 2804 xmn = amean(x, dimension, 1) # keepdims 2805 xdeviations = x - xmn 2806 ymn = amean(y, dimension, 1) # keepdims 2807 ydeviations = y - ymn 2808 if type(dimension) == ListType: 2809 n = 1 2810 for d in dimension: 2811 n = n * x.shape[d] 2812 else: 2813 n = x.shape[dimension] 2814 covar = N.sum(xdeviations * ydeviations) / float(n - 1) 2815 return covar 2816 2817 def avar(inarray, dimension=None, keepdims=0): 2818 """ 2819 Returns the estimated population variance of the values in the passed 2820 array (i.e., N-1). Dimension can equal None (ravel array first), an 2821 integer (the dimension over which to operate), or a sequence (operate 2822 over multiple dimensions). Set keepdims=1 to return an array with the 2823 same number of dimensions as inarray. 2824 2825 Usage: avar(inarray,dimension=None,keepdims=0) 2826 """ 2827 if dimension == None: 2828 inarray = N.ravel(inarray) 2829 dimension = 0 2830 mn = amean(inarray, dimension, 1) 2831 deviations = inarray - mn 2832 if type(dimension) == ListType: 2833 n = 1 2834 for d in dimension: 2835 n = n * inarray.shape[d] 2836 else: 2837 n = inarray.shape[dimension] 2838 var = ass(deviations, dimension, keepdims) / float(n - 1) 2839 return var 2840 2841 def astdev(inarray, dimension=None, keepdims=0): 2842 """ 2843 Returns the estimated population standard deviation of the values in 2844 the passed array (i.e., N-1). Dimension can equal None (ravel array 2845 first), an integer (the dimension over which to operate), or a 2846 sequence (operate over multiple dimensions). Set keepdims=1 to return 2847 an array with the same number of dimensions as inarray. 2848 2849 Usage: astdev(inarray,dimension=None,keepdims=0) 2850 """ 2851 return N.sqrt(avar(inarray, dimension, keepdims)) 2852 2853 def asterr(inarray, dimension=None, keepdims=0): 2854 """ 2855 Returns the estimated population standard error of the values in the 2856 passed array (i.e., N-1). Dimension can equal None (ravel array 2857 first), an integer (the dimension over which to operate), or a 2858 sequence (operate over multiple dimensions). Set keepdims=1 to return 2859 an array with the same number of dimensions as inarray. 2860 2861 Usage: asterr(inarray,dimension=None,keepdims=0) 2862 """ 2863 if dimension == None: 2864 inarray = N.ravel(inarray) 2865 dimension = 0 2866 return astdev(inarray, dimension, 2867 keepdims) / float(N.sqrt(inarray.shape[dimension])) 2868 2869 def asem(inarray, dimension=None, keepdims=0): 2870 """ 2871 Returns the standard error of the mean (i.e., using N) of the values 2872 in the passed array. Dimension can equal None (ravel array first), an 2873 integer (the dimension over which to operate), or a sequence (operate 2874 over multiple dimensions). Set keepdims=1 to return an array with the 2875 same number of dimensions as inarray. 2876 2877 Usage: asem(inarray,dimension=None, keepdims=0) 2878 """ 2879 if dimension == None: 2880 inarray = N.ravel(inarray) 2881 dimension = 0 2882 if type(dimension) == ListType: 2883 n = 1 2884 for d in dimension: 2885 n = n * inarray.shape[d] 2886 else: 2887 n = inarray.shape[dimension] 2888 s = asamplestdev(inarray, dimension, keepdims) / N.sqrt(n - 1) 2889 return s 2890 2891 def az(a, score): 2892 """ 2893 Returns the z-score of a given input score, given thearray from which 2894 that score came. Not appropriate for population calculations, nor for 2895 arrays > 1D. 2896 2897 Usage: az(a, score) 2898 """ 2899 z = (score - amean(a)) / asamplestdev(a) 2900 return z 2901 2902 def azs(a): 2903 """ 2904 Returns a 1D array of z-scores, one for each score in the passed array, 2905 computed relative to the passed array. 2906 2907 Usage: azs(a) 2908 """ 2909 zscores = [] 2910 for item in a: 2911 zscores.append(z(a, item)) 2912 return N.array(zscores) 2913 2914 def azmap(scores, compare, dimension=0): 2915 """ 2916 Returns an array of z-scores the shape of scores (e.g., [x,y]), compared to 2917 array passed to compare (e.g., [time,x,y]). Assumes collapsing over dim 0 2918 of the compare array. 2919 2920 Usage: azs(scores, compare, dimension=0) 2921 """ 2922 mns = amean(compare, dimension) 2923 sstd = asamplestdev(compare, 0) 2924 return (scores - mns) / sstd 2925 2926 ##################################### 2927 ####### ATRIMMING FUNCTIONS ####### 2928 ##################################### 2929 2930 ## deleted around() as it's in numpy now 2931 2932 def athreshold(a, threshmin=None, threshmax=None, newval=0): 2933 """ 2934 Like Numeric.clip() except that values <threshmid or >threshmax are replaced 2935 by newval instead of by threshmin/threshmax (respectively). 2936 2937 Usage: athreshold(a,threshmin=None,threshmax=None,newval=0) 2938 Returns: a, with values <threshmin or >threshmax replaced with newval 2939 """ 2940 mask = N.zeros(a.shape) 2941 if threshmin <> None: 2942 mask = mask + N.where(a < threshmin, 1, 0) 2943 if threshmax <> None: 2944 mask = mask + N.where(a > threshmax, 1, 0) 2945 mask = N.clip(mask, 0, 1) 2946 return N.where(mask, newval, a) 2947 2948 def atrimboth(a, proportiontocut): 2949 """ 2950 Slices off the passed proportion of items from BOTH ends of the passed 2951 array (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 2952 'rightmost' 10% of scores. You must pre-sort the array if you want 2953 "proper" trimming. Slices off LESS if proportion results in a 2954 non-integer slice index (i.e., conservatively slices off 2955 proportiontocut). 2956 2957 Usage: atrimboth (a,proportiontocut) 2958 Returns: trimmed version of array a 2959 """ 2960 lowercut = int(proportiontocut * len(a)) 2961 uppercut = len(a) - lowercut 2962 return a[lowercut:uppercut] 2963 2964 def atrim1(a, proportiontocut, tail='right'): 2965 """ 2966 Slices off the passed proportion of items from ONE end of the passed 2967 array (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 2968 10% of scores). Slices off LESS if proportion results in a non-integer 2969 slice index (i.e., conservatively slices off proportiontocut). 2970 2971 Usage: atrim1(a,proportiontocut,tail='right') or set tail='left' 2972 Returns: trimmed version of array a 2973 """ 2974 if string.lower(tail) == 'right': 2975 lowercut = 0 2976 uppercut = len(a) - int(proportiontocut * len(a)) 2977 elif string.lower(tail) == 'left': 2978 lowercut = int(proportiontocut * len(a)) 2979 uppercut = len(a) 2980 return a[lowercut:uppercut] 2981 2982 ##################################### 2983 ##### ACORRELATION FUNCTIONS ###### 2984 ##################################### 2985 2986 def acovariance(X): 2987 """ 2988 Computes the covariance matrix of a matrix X. Requires a 2D matrix input. 2989 2990 Usage: acovariance(X) 2991 Returns: covariance matrix of X 2992 """ 2993 if len(X.shape) <> 2: 2994 raise TypeError, 'acovariance requires 2D matrices' 2995 n = X.shape[0] 2996 mX = amean(X, 0) 2997 return N.dot(N.transpose(X), X) / float(n) - N.multiply.outer(mX, mX) 2998 2999 def acorrelation(X): 3000 """ 3001 Computes the correlation matrix of a matrix X. Requires a 2D matrix input. 3002 3003 Usage: acorrelation(X) 3004 Returns: correlation matrix of X 3005 """ 3006 C = acovariance(X) 3007 V = N.diagonal(C) 3008 return C / N.sqrt(N.multiply.outer(V, V)) 3009 3010 def apaired(x, y): 3011 """ 3012 Interactively determines the type of data in x and y, and then runs the 3013 appropriated statistic for paired group data. 3014 3015 Usage: apaired(x,y) x,y = the two arrays of values to be compared 3016 Returns: appropriate statistic name, value, and probability 3017 """ 3018 samples = '' 3019 while samples not in ['i', 'r', 'I', 'R', 'c', 'C']: 3020 print '\nIndependent or related samples, or correlation (i,r,c): ', 3021 samples = raw_input() 3022 3023 if samples in ['i', 'I', 'r', 'R']: 3024 print '\nComparing variances ...', 3025 # USE O'BRIEN'S TEST FOR HOMOGENEITY OF VARIANCE, Maxwell & delaney, p.112 3026 r = obrientransform(x, y) 3027 f, p = F_oneway(pstat.colex(r, 0), pstat.colex(r, 1)) 3028 if p < 0.05: 3029 vartype = 'unequal, p=' + str(round(p, 4)) 3030 else: 3031 vartype = 'equal' 3032 print vartype 3033 if samples in ['i', 'I']: 3034 if vartype[0] == 'e': 3035 t, p = ttest_ind(x, y, None, 0) 3036 print '\nIndependent samples t-test: ', round(t, 4), round(p, 4) 3037 else: 3038 if len(x) > 20 or len(y) > 20: 3039 z, p = ranksums(x, y) 3040 print '\nRank Sums test (NONparametric, n>20): ', round( 3041 z, 4), round(p, 4) 3042 else: 3043 u, p = mannwhitneyu(x, y) 3044 print '\nMann-Whitney U-test (NONparametric, ns<20): ', round( 3045 u, 4), round(p, 4) 3046 3047 else: # RELATED SAMPLES 3048 if vartype[0] == 'e': 3049 t, p = ttest_rel(x, y, 0) 3050 print '\nRelated samples t-test: ', round(t, 4), round(p, 4) 3051 else: 3052 t, p = ranksums(x, y) 3053 print '\nWilcoxon T-test (NONparametric): ', round(t, 4), round(p, 4) 3054 else: # CORRELATION ANALYSIS 3055 corrtype = '' 3056 while corrtype not in ['c', 'C', 'r', 'R', 'd', 'D']: 3057 print '\nIs the data Continuous, Ranked, or Dichotomous (c,r,d): ', 3058 corrtype = raw_input() 3059 if corrtype in ['c', 'C']: 3060 m, b, r, p, see = linregress(x, y) 3061 print '\nLinear regression for continuous variables ...' 3062 lol = [ 3063 ['Slope', 'Intercept', 'r', 'Prob', 'SEestimate'], 3064 [round(m, 4), round(b, 4), round(r, 4), round(p, 4), round(see, 4)] 3065 ] 3066 pstat.printcc(lol) 3067 elif corrtype in ['r', 'R']: 3068 r, p = spearmanr(x, y) 3069 print '\nCorrelation for ranked variables ...' 3070 print "Spearman's r: ", round(r, 4), round(p, 4) 3071 else: # DICHOTOMOUS 3072 r, p = pointbiserialr(x, y) 3073 print '\nAssuming x contains a dichotomous variable ...' 3074 print 'Point Biserial r: ', round(r, 4), round(p, 4) 3075 print '\n\n' 3076 return None 3077 3078 def dices(x, y): 3079 """ 3080 Calculates Dice's coefficient ... (2*number of common terms)/(number of terms in 3081 x + 3082 number of terms in y). Returns a value between 0 (orthogonal) and 1. 3083 3084 Usage: dices(x,y) 3085 """ 3086 import sets 3087 x = sets.Set(x) 3088 y = sets.Set(y) 3089 common = len(x.intersection(y)) 3090 total = float(len(x) + len(y)) 3091 return 2 * common / total 3092 3093 def icc(x, y=None, verbose=0): 3094 """ 3095 Calculates intraclass correlation coefficients using simple, Type I sums of 3096 squares. 3097 If only one variable is passed, assumed it's an Nx2 matrix 3098 3099 Usage: icc(x,y=None,verbose=0) 3100 Returns: icc rho, prob ####PROB IS A GUESS BASED ON PEARSON 3101 """ 3102 TINY = 1.0e-20 3103 if y: 3104 all = N.concatenate([x, y], 0) 3105 else: 3106 all = x + 0 3107 x = all[:, 0] 3108 y = all[:, 1] 3109 totalss = ass(all - mean(all)) 3110 pairmeans = (x + y) / 2. 3111 withinss = ass(x - pairmeans) + ass(y - pairmeans) 3112 withindf = float(len(x)) 3113 betwdf = float(len(x) - 1) 3114 withinms = withinss / withindf 3115 betweenms = (totalss - withinss) / betwdf 3116 rho = (betweenms - withinms) / (withinms + betweenms) 3117 t = rho * math.sqrt(betwdf / ((1.0 - rho + TINY) * (1.0 + rho + TINY))) 3118 prob = abetai(0.5 * betwdf, 0.5, betwdf / (betwdf + t * t), verbose) 3119 return rho, prob 3120 3121 def alincc(x, y): 3122 """ 3123 Calculates Lin's concordance correlation coefficient. 3124 3125 Usage: alincc(x,y) where x, y are equal-length arrays 3126 Returns: Lin's CC 3127 """ 3128 x = N.ravel(x) 3129 y = N.ravel(y) 3130 covar = acov(x, y) * (len(x) - 1) / float(len(x)) # correct denom to n 3131 xvar = avar(x) * (len(x) - 1) / float(len(x)) # correct denom to n 3132 yvar = avar(y) * (len(y) - 1) / float(len(y)) # correct denom to n 3133 lincc = (2 * covar) / ((xvar + yvar) + ((amean(x) - amean(y))**2)) 3134 return lincc 3135 3136 def apearsonr(x, y, verbose=1): 3137 """ 3138 Calculates a Pearson correlation coefficient and returns p. Taken 3139 from Heiman's Basic Statistics for the Behav. Sci (2nd), p.195. 3140 3141 Usage: apearsonr(x,y,verbose=1) where x,y are equal length arrays 3142 Returns: Pearson's r, two-tailed p-value 3143 """ 3144 TINY = 1.0e-20 3145 n = len(x) 3146 xmean = amean(x) 3147 ymean = amean(y) 3148 r_num = n * (N.add.reduce(x * y)) - N.add.reduce(x) * N.add.reduce(y) 3149 r_den = math.sqrt((n * ass(x) - asquare_of_sums(x)) * 3150 (n * ass(y) - asquare_of_sums(y))) 3151 r = (r_num / r_den) 3152 df = n - 2 3153 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 3154 prob = abetai(0.5 * df, 0.5, df / (df + t * t), verbose) 3155 return r, prob 3156 3157 def aspearmanr(x, y): 3158 """ 3159 Calculates a Spearman rank-order correlation coefficient. Taken 3160 from Heiman's Basic Statistics for the Behav. Sci (1st), p.192. 3161 3162 Usage: aspearmanr(x,y) where x,y are equal-length arrays 3163 Returns: Spearman's r, two-tailed p-value 3164 """ 3165 TINY = 1e-30 3166 n = len(x) 3167 rankx = rankdata(x) 3168 ranky = rankdata(y) 3169 dsq = N.add.reduce((rankx - ranky)**2) 3170 rs = 1 - 6 * dsq / float(n * (n**2 - 1)) 3171 t = rs * math.sqrt((n - 2) / ((rs + 1.0) * (1.0 - rs))) 3172 df = n - 2 3173 probrs = abetai(0.5 * df, 0.5, df / (df + t * t)) 3174 # probability values for rs are from part 2 of the spearman function in 3175 # Numerical Recipies, p.510. They close to tables, but not exact.(?) 3176 return rs, probrs 3177 3178 def apointbiserialr(x, y): 3179 """ 3180 Calculates a point-biserial correlation coefficient and the associated 3181 probability value. Taken from Heiman's Basic Statistics for the Behav. 3182 Sci (1st), p.194. 3183 3184 Usage: apointbiserialr(x,y) where x,y are equal length arrays 3185 Returns: Point-biserial r, two-tailed p-value 3186 """ 3187 TINY = 1e-30 3188 categories = pstat.aunique(x) 3189 data = pstat.aabut(x, y) 3190 if len(categories) <> 2: 3191 raise ValueError, ('Exactly 2 categories required (in x) for ' 3192 'pointbiserialr().') 3193 else: # there are 2 categories, continue 3194 codemap = pstat.aabut(categories, N.arange(2)) 3195 recoded = pstat.arecode(data, codemap, 0) 3196 x = pstat.alinexand(data, 0, categories[0]) 3197 y = pstat.alinexand(data, 0, categories[1]) 3198 xmean = amean(pstat.acolex(x, 1)) 3199 ymean = amean(pstat.acolex(y, 1)) 3200 n = len(data) 3201 adjust = math.sqrt((len(x) / float(n)) * (len(y) / float(n))) 3202 rpb = (ymean - xmean) / asamplestdev(pstat.acolex(data, 1)) * adjust 3203 df = n - 2 3204 t = rpb * math.sqrt(df / ((1.0 - rpb + TINY) * (1.0 + rpb + TINY))) 3205 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3206 return rpb, prob 3207 3208 def akendalltau(x, y): 3209 """ 3210 Calculates Kendall's tau ... correlation of ordinal data. Adapted 3211 from function kendl1 in Numerical Recipies. Needs good test-cases.@@@ 3212 3213 Usage: akendalltau(x,y) 3214 Returns: Kendall's tau, two-tailed p-value 3215 """ 3216 n1 = 0 3217 n2 = 0 3218 iss = 0 3219 for j in range(len(x) - 1): 3220 for k in range(j, len(y)): 3221 a1 = x[j] - x[k] 3222 a2 = y[j] - y[k] 3223 aa = a1 * a2 3224 if (aa): # neither array has a tie 3225 n1 = n1 + 1 3226 n2 = n2 + 1 3227 if aa > 0: 3228 iss = iss + 1 3229 else: 3230 iss = iss - 1 3231 else: 3232 if (a1): 3233 n1 = n1 + 1 3234 else: 3235 n2 = n2 + 1 3236 tau = iss / math.sqrt(n1 * n2) 3237 svar = (4.0 * len(x) + 10.0) / (9.0 * len(x) * (len(x) - 1)) 3238 z = tau / math.sqrt(svar) 3239 prob = erfcc(abs(z) / 1.4142136) 3240 return tau, prob 3241 3242 def alinregress(*args): 3243 """ 3244 Calculates a regression line on two arrays, x and y, corresponding to x,y 3245 pairs. If a single 2D array is passed, alinregress finds dim with 2 levels 3246 and splits data into x,y pairs along that dim. 3247 3248 Usage: alinregress(*args) args=2 equal-length arrays, or one 2D array 3249 Returns: slope, intercept, r, two-tailed prob, sterr-of-the-estimate, n 3250 """ 3251 TINY = 1.0e-20 3252 if len(args) == 1: # more than 1D array? 3253 args = args[0] 3254 if len(args) == 2: 3255 x = args[0] 3256 y = args[1] 3257 else: 3258 x = args[:, 0] 3259 y = args[:, 1] 3260 else: 3261 x = args[0] 3262 y = args[1] 3263 n = len(x) 3264 xmean = amean(x) 3265 ymean = amean(y) 3266 r_num = n * (N.add.reduce(x * y)) - N.add.reduce(x) * N.add.reduce(y) 3267 r_den = math.sqrt((n * ass(x) - asquare_of_sums(x)) * 3268 (n * ass(y) - asquare_of_sums(y))) 3269 r = r_num / r_den 3270 z = 0.5 * math.log((1.0 + r + TINY) / (1.0 - r + TINY)) 3271 df = n - 2 3272 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 3273 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3274 slope = r_num / (float(n) * ass(x) - asquare_of_sums(x)) 3275 intercept = ymean - slope * xmean 3276 sterrest = math.sqrt(1 - r * r) * asamplestdev(y) 3277 return slope, intercept, r, prob, sterrest, n 3278 3279 def amasslinregress(*args): 3280 """ 3281 Calculates a regression line on one 1D array (x) and one N-D array (y). 3282 3283 Returns: slope, intercept, r, two-tailed prob, sterr-of-the-estimate, n 3284 """ 3285 TINY = 1.0e-20 3286 if len(args) == 1: # more than 1D array? 3287 args = args[0] 3288 if len(args) == 2: 3289 x = N.ravel(args[0]) 3290 y = args[1] 3291 else: 3292 x = N.ravel(args[:, 0]) 3293 y = args[:, 1] 3294 else: 3295 x = args[0] 3296 y = args[1] 3297 x = x.astype(N.float_) 3298 y = y.astype(N.float_) 3299 n = len(x) 3300 xmean = amean(x) 3301 ymean = amean(y, 0) 3302 shp = N.ones(len(y.shape)) 3303 shp[0] = len(x) 3304 x.shape = shp 3305 print x.shape, y.shape 3306 r_num = n * (N.add.reduce(x * y, 0)) - N.add.reduce(x) * N.add.reduce(y, 0) 3307 r_den = N.sqrt((n * ass(x) - asquare_of_sums(x)) * 3308 (n * ass(y, 0) - asquare_of_sums(y, 0))) 3309 zerodivproblem = N.equal(r_den, 0) 3310 r_den = N.where(zerodivproblem, 1, r_den 3311 ) # avoid zero-division in 1st place 3312 r = r_num / r_den # need to do this nicely for matrix division 3313 r = N.where(zerodivproblem, 0.0, r) 3314 z = 0.5 * N.log((1.0 + r + TINY) / (1.0 - r + TINY)) 3315 df = n - 2 3316 t = r * N.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 3317 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3318 3319 ss = float(n) * ass(x) - asquare_of_sums(x) 3320 s_den = N.where(ss == 0, 1, ss) # avoid zero-division in 1st place 3321 slope = r_num / s_den 3322 intercept = ymean - slope * xmean 3323 sterrest = N.sqrt(1 - r * r) * asamplestdev(y, 0) 3324 return slope, intercept, r, prob, sterrest, n 3325 3326 ##################################### 3327 ##### AINFERENTIAL STATISTICS ##### 3328 ##################################### 3329 3330 def attest_1samp(a, popmean, printit=0, name='Sample', writemode='a'): 3331 """ 3332 Calculates the t-obtained for the independent samples T-test on ONE group 3333 of scores a, given a population mean. If printit=1, results are printed 3334 to the screen. If printit='filename', the results are output to 'filename' 3335 using the given writemode (default=append). Returns t-value, and prob. 3336 3337 Usage: attest_1samp(a,popmean,Name='Sample',printit=0,writemode='a') 3338 Returns: t-value, two-tailed prob 3339 """ 3340 if type(a) != N.ndarray: 3341 a = N.array(a) 3342 x = amean(a) 3343 v = avar(a) 3344 n = len(a) 3345 df = n - 1 3346 svar = ((n - 1) * v) / float(df) 3347 t = (x - popmean) / math.sqrt(svar * (1.0 / n)) 3348 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3349 3350 if printit <> 0: 3351 statname = 'Single-sample T-test.' 3352 outputpairedstats(printit, writemode, 'Population', '--', popmean, 0, 0, 3353 0, name, n, x, v, N.minimum.reduce(N.ravel(a)), 3354 N.maximum.reduce(N.ravel(a)), statname, t, prob) 3355 return t, prob 3356 3357 def attest_ind(a, 3358 b, 3359 dimension=None, 3360 printit=0, 3361 name1='Samp1', 3362 name2='Samp2', 3363 writemode='a'): 3364 """ 3365 Calculates the t-obtained T-test on TWO INDEPENDENT samples of scores 3366 a, and b. From Numerical Recipies, p.483. If printit=1, results are 3367 printed to the screen. If printit='filename', the results are output 3368 to 'filename' using the given writemode (default=append). Dimension 3369 can equal None (ravel array first), or an integer (the dimension over 3370 which to operate on a and b). 3371 3372 Usage: attest_ind (a,b,dimension=None,printit=0, 3373 Name1='Samp1',Name2='Samp2',writemode='a') 3374 Returns: t-value, two-tailed p-value 3375 """ 3376 if dimension == None: 3377 a = N.ravel(a) 3378 b = N.ravel(b) 3379 dimension = 0 3380 x1 = amean(a, dimension) 3381 x2 = amean(b, dimension) 3382 v1 = avar(a, dimension) 3383 v2 = avar(b, dimension) 3384 n1 = a.shape[dimension] 3385 n2 = b.shape[dimension] 3386 df = n1 + n2 - 2 3387 svar = ((n1 - 1) * v1 + (n2 - 1) * v2) / float(df) 3388 zerodivproblem = N.equal(svar, 0) 3389 svar = N.where(zerodivproblem, 1, svar) # avoid zero-division in 1st place 3390 t = (x1 - x2) / N.sqrt(svar * 3391 (1.0 / n1 + 1.0 / n2)) # N-D COMPUTATION HERE!!!!!! 3392 t = N.where(zerodivproblem, 1.0, t) # replace NaN/wrong t-values with 1.0 3393 probs = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3394 3395 if type(t) == N.ndarray: 3396 probs = N.reshape(probs, t.shape) 3397 if probs.shape == (1,): 3398 probs = probs[0] 3399 3400 if printit <> 0: 3401 if type(t) == N.ndarray: 3402 t = t[0] 3403 if type(probs) == N.ndarray: 3404 probs = probs[0] 3405 statname = 'Independent samples T-test.' 3406 outputpairedstats(printit, writemode, name1, n1, x1, v1, 3407 N.minimum.reduce(N.ravel(a)), 3408 N.maximum.reduce(N.ravel(a)), name2, n2, x2, v2, 3409 N.minimum.reduce(N.ravel(b)), 3410 N.maximum.reduce(N.ravel(b)), statname, t, probs) 3411 return 3412 return t, probs 3413 3414 def ap2t(pval, df): 3415 """ 3416 Tries to compute a t-value from a p-value (or pval array) and associated df. 3417 SLOW for large numbers of elements(!) as it re-computes p-values 20 times 3418 (smaller step-sizes) at which point it decides it's done. Keeps the signs 3419 of the input array. Returns 1000 (or -1000) if t>100. 3420 3421 Usage: ap2t(pval,df) 3422 Returns: an array of t-values with the shape of pval 3423 """ 3424 pval = N.array(pval) 3425 signs = N.sign(pval) 3426 pval = abs(pval) 3427 t = N.ones(pval.shape, N.float_) * 50 3428 step = N.ones(pval.shape, N.float_) * 25 3429 print 'Initial ap2t() prob calc' 3430 prob = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3431 print 'ap2t() iter: ', 3432 for i in range(10): 3433 print i, ' ', 3434 t = N.where(pval < prob, t + step, t - step) 3435 prob = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3436 step = step / 2 3437 print 3438 # since this is an ugly hack, we get ugly boundaries 3439 t = N.where(t > 99.9, 1000, t) # hit upper-boundary 3440 t = t + signs 3441 return t #, prob, pval 3442 3443 def attest_rel(a, 3444 b, 3445 dimension=None, 3446 printit=0, 3447 name1='Samp1', 3448 name2='Samp2', 3449 writemode='a'): 3450 """ 3451 Calculates the t-obtained T-test on TWO RELATED samples of scores, a 3452 and b. From Numerical Recipies, p.483. If printit=1, results are 3453 printed to the screen. If printit='filename', the results are output 3454 to 'filename' using the given writemode (default=append). Dimension 3455 can equal None (ravel array first), or an integer (the dimension over 3456 which to operate on a and b). 3457 3458 Usage: attest_rel(a,b,dimension=None,printit=0, 3459 name1='Samp1',name2='Samp2',writemode='a') 3460 Returns: t-value, two-tailed p-value 3461 """ 3462 if dimension == None: 3463 a = N.ravel(a) 3464 b = N.ravel(b) 3465 dimension = 0 3466 if len(a) <> len(b): 3467 raise ValueError, 'Unequal length arrays.' 3468 x1 = amean(a, dimension) 3469 x2 = amean(b, dimension) 3470 v1 = avar(a, dimension) 3471 v2 = avar(b, dimension) 3472 n = a.shape[dimension] 3473 df = float(n - 1) 3474 d = (a - b).astype('d') 3475 3476 denom = N.sqrt( 3477 (n * N.add.reduce(d * d, dimension) - N.add.reduce(d, dimension)**2) / 3478 df) 3479 zerodivproblem = N.equal(denom, 0) 3480 denom = N.where(zerodivproblem, 1, denom 3481 ) # avoid zero-division in 1st place 3482 t = N.add.reduce(d, dimension) / denom # N-D COMPUTATION HERE!!!!!! 3483 t = N.where(zerodivproblem, 1.0, t) # replace NaN/wrong t-values with 1.0 3484 probs = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3485 if type(t) == N.ndarray: 3486 probs = N.reshape(probs, t.shape) 3487 if probs.shape == (1,): 3488 probs = probs[0] 3489 3490 if printit <> 0: 3491 statname = 'Related samples T-test.' 3492 outputpairedstats(printit, writemode, name1, n, x1, v1, 3493 N.minimum.reduce(N.ravel(a)), 3494 N.maximum.reduce(N.ravel(a)), name2, n, x2, v2, 3495 N.minimum.reduce(N.ravel(b)), 3496 N.maximum.reduce(N.ravel(b)), statname, t, probs) 3497 return 3498 return t, probs 3499 3500 def achisquare(f_obs, f_exp=None): 3501 """ 3502 Calculates a one-way chi square for array of observed frequencies and returns 3503 the result. If no expected frequencies are given, the total N is assumed to 3504 be equally distributed across all groups. 3505 @@@NOT RIGHT?? 3506 3507 Usage: achisquare(f_obs, f_exp=None) f_obs = array of observed cell freq. 3508 Returns: chisquare-statistic, associated p-value 3509 """ 3510 3511 k = len(f_obs) 3512 if f_exp == None: 3513 f_exp = N.array([sum(f_obs) / float(k)] * len(f_obs), N.float_) 3514 f_exp = f_exp.astype(N.float_) 3515 chisq = N.add.reduce((f_obs - f_exp)**2 / f_exp) 3516 return chisq, achisqprob(chisq, k - 1) 3517 3518 def aks_2samp(data1, data2): 3519 """ 3520 Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from 3521 Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- 3522 like. 3523 3524 Usage: aks_2samp(data1,data2) where data1 and data2 are 1D arrays 3525 Returns: KS D-value, p-value 3526 """ 3527 j1 = 0 # N.zeros(data1.shape[1:]) TRIED TO MAKE THIS UFUNC-LIKE 3528 j2 = 0 # N.zeros(data2.shape[1:]) 3529 fn1 = 0.0 # N.zeros(data1.shape[1:],N.float_) 3530 fn2 = 0.0 # N.zeros(data2.shape[1:],N.float_) 3531 n1 = data1.shape[0] 3532 n2 = data2.shape[0] 3533 en1 = n1 * 1 3534 en2 = n2 * 1 3535 d = N.zeros(data1.shape[1:], N.float_) 3536 data1 = N.sort(data1, 0) 3537 data2 = N.sort(data2, 0) 3538 while j1 < n1 and j2 < n2: 3539 d1 = data1[j1] 3540 d2 = data2[j2] 3541 if d1 <= d2: 3542 fn1 = (j1) / float(en1) 3543 j1 = j1 + 1 3544 if d2 <= d1: 3545 fn2 = (j2) / float(en2) 3546 j2 = j2 + 1 3547 dt = (fn2 - fn1) 3548 if abs(dt) > abs(d): 3549 d = dt 3550 # try: 3551 en = math.sqrt(en1 * en2 / float(en1 + en2)) 3552 prob = aksprob((en + 0.12 + 0.11 / en) * N.fabs(d)) 3553 # except: 3554 # prob = 1.0 3555 return d, prob 3556 3557 def amannwhitneyu(x, y): 3558 """ 3559 Calculates a Mann-Whitney U statistic on the provided scores and 3560 returns the result. Use only when the n in each condition is < 20 and 3561 you have 2 independent samples of ranks. REMEMBER: Mann-Whitney U is 3562 significant if the u-obtained is LESS THAN or equal to the critical 3563 value of U. 3564 3565 Usage: amannwhitneyu(x,y) where x,y are arrays of values for 2 conditions 3566 Returns: u-statistic, one-tailed p-value (i.e., p(z(U))) 3567 """ 3568 n1 = len(x) 3569 n2 = len(y) 3570 ranked = rankdata(N.concatenate((x, y))) 3571 rankx = ranked[0:n1] # get the x-ranks 3572 ranky = ranked[n1:] # the rest are y-ranks 3573 u1 = n1 * n2 + (n1 * (n1 + 1)) / 2.0 - sum(rankx) # calc U for x 3574 u2 = n1 * n2 - u1 # remainder is U for y 3575 bigu = max(u1, u2) 3576 smallu = min(u1, u2) 3577 proportion = bigu / float(n1 * n2) 3578 T = math.sqrt(tiecorrect(ranked)) # correction factor for tied scores 3579 if T == 0: 3580 raise ValueError, 'All numbers are identical in amannwhitneyu' 3581 sd = math.sqrt(T * n1 * n2 * (n1 + n2 + 1) / 12.0) 3582 z = abs((bigu - n1 * n2 / 2.0) / sd) # normal approximation for prob calc 3583 return smallu, 1.0 - azprob(z), proportion 3584 3585 def atiecorrect(rankvals): 3586 """ 3587 Tie-corrector for ties in Mann Whitney U and Kruskal Wallis H tests. 3588 See Siegel, S. (1956) Nonparametric Statistics for the Behavioral 3589 Sciences. New York: McGraw-Hill. Code adapted from |Stat rankind.c 3590 code. 3591 3592 Usage: atiecorrect(rankvals) 3593 Returns: T correction factor for U or H 3594 """ 3595 sorted, posn = ashellsort(N.array(rankvals)) 3596 n = len(sorted) 3597 T = 0.0 3598 i = 0 3599 while (i < n - 1): 3600 if sorted[i] == sorted[i + 1]: 3601 nties = 1 3602 while (i < n - 1) and (sorted[i] == sorted[i + 1]): 3603 nties = nties + 1 3604 i = i + 1 3605 T = T + nties**3 - nties 3606 i = i + 1 3607 T = T / float(n**3 - n) 3608 return 1.0 - T 3609 3610 def aranksums(x, y): 3611 """ 3612 Calculates the rank sums statistic on the provided scores and returns 3613 the result. 3614 3615 Usage: aranksums(x,y) where x,y are arrays of values for 2 conditions 3616 Returns: z-statistic, two-tailed p-value 3617 """ 3618 n1 = len(x) 3619 n2 = len(y) 3620 alldata = N.concatenate((x, y)) 3621 ranked = arankdata(alldata) 3622 x = ranked[:n1] 3623 y = ranked[n1:] 3624 s = sum(x) 3625 expected = n1 * (n1 + n2 + 1) / 2.0 3626 z = (s - expected) / math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12.0) 3627 prob = 2 * (1.0 - azprob(abs(z))) 3628 return z, prob 3629 3630 def awilcoxont(x, y): 3631 """ 3632 Calculates the Wilcoxon T-test for related samples and returns the 3633 result. A non-parametric T-test. 3634 3635 Usage: awilcoxont(x,y) where x,y are equal-length arrays for 2 conditions 3636 Returns: t-statistic, two-tailed p-value 3637 """ 3638 if len(x) <> len(y): 3639 raise ValueError, 'Unequal N in awilcoxont. Aborting.' 3640 d = x - y 3641 d = N.compress(N.not_equal(d, 0), d) # Keep all non-zero differences 3642 count = len(d) 3643 absd = abs(d) 3644 absranked = arankdata(absd) 3645 r_plus = 0.0 3646 r_minus = 0.0 3647 for i in range(len(absd)): 3648 if d[i] < 0: 3649 r_minus = r_minus + absranked[i] 3650 else: 3651 r_plus = r_plus + absranked[i] 3652 wt = min(r_plus, r_minus) 3653 mn = count * (count + 1) * 0.25 3654 se = math.sqrt(count * (count + 1) * (2.0 * count + 1.0) / 24.0) 3655 z = math.fabs(wt - mn) / se 3656 z = math.fabs(wt - mn) / se 3657 prob = 2 * (1.0 - zprob(abs(z))) 3658 return wt, prob 3659 3660 def akruskalwallish(*args): 3661 """ 3662 The Kruskal-Wallis H-test is a non-parametric ANOVA for 3 or more 3663 groups, requiring at least 5 subjects in each group. This function 3664 calculates the Kruskal-Wallis H and associated p-value for 3 or more 3665 independent samples. 3666 3667 Usage: akruskalwallish(*args) args are separate arrays for 3+ conditions 3668 Returns: H-statistic (corrected for ties), associated p-value 3669 """ 3670 assert len(args) == 3, 'Need at least 3 groups in stats.akruskalwallish()' 3671 args = list(args) 3672 n = [0] * len(args) 3673 n = map(len, args) 3674 all = [] 3675 for i in range(len(args)): 3676 all = all + args[i].tolist() 3677 ranked = rankdata(all) 3678 T = tiecorrect(ranked) 3679 for i in range(len(args)): 3680 args[i] = ranked[0:n[i]] 3681 del ranked[0:n[i]] 3682 rsums = [] 3683 for i in range(len(args)): 3684 rsums.append(sum(args[i])**2) 3685 rsums[i] = rsums[i] / float(n[i]) 3686 ssbn = sum(rsums) 3687 totaln = sum(n) 3688 h = 12.0 / (totaln * (totaln + 1)) * ssbn - 3 * (totaln + 1) 3689 df = len(args) - 1 3690 if T == 0: 3691 raise ValueError, 'All numbers are identical in akruskalwallish' 3692 h = h / float(T) 3693 return h, chisqprob(h, df) 3694 3695 def afriedmanchisquare(*args): 3696 """ 3697 Friedman Chi-Square is a non-parametric, one-way within-subjects 3698 ANOVA. This function calculates the Friedman Chi-square test for 3699 repeated measures and returns the result, along with the associated 3700 probability value. It assumes 3 or more repeated measures. Only 3 3701 levels requires a minimum of 10 subjects in the study. Four levels 3702 requires 5 subjects per level(??). 3703 3704 Usage: afriedmanchisquare(*args) args are separate arrays for 2+ conditions 3705 Returns: chi-square statistic, associated p-value 3706 """ 3707 k = len(args) 3708 if k < 3: 3709 raise ValueError, ('\nLess than 3 levels. Friedman test not ' 3710 'appropriate.\n') 3711 n = len(args[0]) 3712 data = apply(pstat.aabut, args) 3713 data = data.astype(N.float_) 3714 for i in range(len(data)): 3715 data[i] = arankdata(data[i]) 3716 ssbn = asum(asum(args, 1)**2) 3717 chisq = 12.0 / (k * n * (k + 1)) * ssbn - 3 * n * (k + 1) 3718 return chisq, achisqprob(chisq, k - 1) 3719 3720 ##################################### 3721 #### APROBABILITY CALCULATIONS #### 3722 ##################################### 3723 3724 def achisqprob(chisq, df): 3725 """ 3726 Returns the (1-tail) probability value associated with the provided chi-square 3727 value and df. Heavily modified from chisq.c in Gary Perlman's |Stat. Can 3728 handle multiple dimensions. 3729 3730 Usage: achisqprob(chisq,df) chisq=chisquare stat., df=degrees of freedom 3731 """ 3732 BIG = 200.0 3733 3734 def ex(x): 3735 BIG = 200.0 3736 exponents = N.where(N.less(x, -BIG), -BIG, x) 3737 return N.exp(exponents) 3738 3739 if type(chisq) == N.ndarray: 3740 arrayflag = 1 3741 else: 3742 arrayflag = 0 3743 chisq = N.array([chisq]) 3744 if df < 1: 3745 return N.ones(chisq.shape, N.float) 3746 probs = N.zeros(chisq.shape, N.float_) 3747 probs = N.where( 3748 N.less_equal(chisq, 0), 1.0, probs) # set prob=1 for chisq<0 3749 a = 0.5 * chisq 3750 if df > 1: 3751 y = ex(-a) 3752 if df % 2 == 0: 3753 even = 1 3754 s = y * 1 3755 s2 = s * 1 3756 else: 3757 even = 0 3758 s = 2.0 * azprob(-N.sqrt(chisq)) 3759 s2 = s * 1 3760 if (df > 2): 3761 chisq = 0.5 * (df - 1.0) 3762 if even: 3763 z = N.ones(probs.shape, N.float_) 3764 else: 3765 z = 0.5 * N.ones(probs.shape, N.float_) 3766 if even: 3767 e = N.zeros(probs.shape, N.float_) 3768 else: 3769 e = N.log(N.sqrt(N.pi)) * N.ones(probs.shape, N.float_) 3770 c = N.log(a) 3771 mask = N.zeros(probs.shape) 3772 a_big = N.greater(a, BIG) 3773 a_big_frozen = -1 * N.ones(probs.shape, N.float_) 3774 totalelements = N.multiply.reduce(N.array(probs.shape)) 3775 while asum(mask) <> totalelements: 3776 e = N.log(z) + e 3777 s = s + ex(c * z - a - e) 3778 z = z + 1.0 3779 # print z, e, s 3780 newmask = N.greater(z, chisq) 3781 a_big_frozen = N.where(newmask * N.equal(mask, 0) * a_big, s, 3782 a_big_frozen) 3783 mask = N.clip(newmask + mask, 0, 1) 3784 if even: 3785 z = N.ones(probs.shape, N.float_) 3786 e = N.ones(probs.shape, N.float_) 3787 else: 3788 z = 0.5 * N.ones(probs.shape, N.float_) 3789 e = 1.0 / N.sqrt(N.pi) / N.sqrt(a) * N.ones(probs.shape, N.float_) 3790 c = 0.0 3791 mask = N.zeros(probs.shape) 3792 a_notbig_frozen = -1 * N.ones(probs.shape, N.float_) 3793 while asum(mask) <> totalelements: 3794 e = e * (a / z.astype(N.float_)) 3795 c = c + e 3796 z = z + 1.0 3797 # print '#2', z, e, c, s, c*y+s2 3798 newmask = N.greater(z, chisq) 3799 a_notbig_frozen = N.where(newmask * N.equal(mask, 0) * (1 - a_big), 3800 c * y + s2, a_notbig_frozen) 3801 mask = N.clip(newmask + mask, 0, 1) 3802 probs = N.where( 3803 N.equal(probs, 1), 1, N.where( 3804 N.greater(a, BIG), a_big_frozen, a_notbig_frozen)) 3805 return probs 3806 else: 3807 return s 3808 3809 def aerfcc(x): 3810 """ 3811 Returns the complementary error function erfc(x) with fractional error 3812 everywhere less than 1.2e-7. Adapted from Numerical Recipies. Can 3813 handle multiple dimensions. 3814 3815 Usage: aerfcc(x) 3816 """ 3817 z = abs(x) 3818 t = 1.0 / (1.0 + 0.5 * z) 3819 ans = t * N.exp(-z * z - 1.26551223 + t * (1.00002368 + t * ( 3820 0.37409196 + t * (0.09678418 + t * (-0.18628806 + t * ( 3821 0.27886807 + t * (-1.13520398 + t * (1.48851587 + t * ( 3822 -0.82215223 + t * 0.17087277))))))))) 3823 return N.where(N.greater_equal(x, 0), ans, 2.0 - ans) 3824 3825 def azprob(z): 3826 """ 3827 Returns the area under the normal curve 'to the left of' the given z value. 3828 Thus, 3829 for z<0, zprob(z) = 1-tail probability 3830 for z>0, 1.0-zprob(z) = 1-tail probability 3831 for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability 3832 Adapted from z.c in Gary Perlman's |Stat. Can handle multiple dimensions. 3833 3834 Usage: azprob(z) where z is a z-value 3835 """ 3836 3837 def yfunc(y): 3838 x = ((((((( 3839 ((((((-0.000045255659 * y + 0.000152529290) * y - 0.000019538132) * y 3840 - 0.000676904986) * y + 0.001390604284) * y - 0.000794620820) * y 3841 - 0.002034254874) * y + 0.006549791214) * y - 0.010557625006) * y + 3842 0.011630447319) * y - 0.009279453341) * y + 0.005353579108) * y - 3843 0.002141268741) * y + 0.000535310849) * y + 0.999936657524 3844 return x 3845 3846 def wfunc(w): 3847 x = ((((((((0.000124818987 * w - 0.001075204047) * w + 0.005198775019) * w 3848 - 0.019198292004) * w + 0.059054035642) * w - 0.151968751364) * 3849 w + 0.319152932694) * w - 0.531923007300) * w + 3850 0.797884560593) * N.sqrt(w) * 2.0 3851 return x 3852 3853 Z_MAX = 6.0 # maximum meaningful z-value 3854 x = N.zeros(z.shape, N.float_) # initialize 3855 y = 0.5 * N.fabs(z) 3856 x = N.where(N.less(y, 1.0), wfunc(y * y), yfunc(y - 2.0)) # get x's 3857 x = N.where(N.greater(y, Z_MAX * 0.5), 1.0, x) # kill those with big Z 3858 prob = N.where(N.greater(z, 0), (x + 1) * 0.5, (1 - x) * 0.5) 3859 return prob 3860 3861 def aksprob(alam): 3862 """ 3863 Returns the probability value for a K-S statistic computed via ks_2samp. 3864 Adapted from Numerical Recipies. Can handle multiple dimensions. 3865 3866 Usage: aksprob(alam) 3867 """ 3868 if type(alam) == N.ndarray: 3869 frozen = -1 * N.ones(alam.shape, N.float64) 3870 alam = alam.astype(N.float64) 3871 arrayflag = 1 3872 else: 3873 frozen = N.array(-1.) 3874 alam = N.array(alam, N.float64) 3875 arrayflag = 1 3876 mask = N.zeros(alam.shape) 3877 fac = 2.0 * N.ones(alam.shape, N.float_) 3878 sum = N.zeros(alam.shape, N.float_) 3879 termbf = N.zeros(alam.shape, N.float_) 3880 a2 = N.array(-2.0 * alam * alam, N.float64) 3881 totalelements = N.multiply.reduce(N.array(mask.shape)) 3882 for j in range(1, 201): 3883 if asum(mask) == totalelements: 3884 break 3885 exponents = (a2 * j * j) 3886 overflowmask = N.less(exponents, -746) 3887 frozen = N.where(overflowmask, 0, frozen) 3888 mask = mask + overflowmask 3889 term = fac * N.exp(exponents) 3890 sum = sum + term 3891 newmask = N.where( 3892 N.less_equal( 3893 abs(term), (0.001 * termbf)) + N.less( 3894 abs(term), 1.0e-8 * sum), 1, 0) 3895 frozen = N.where(newmask * N.equal(mask, 0), sum, frozen) 3896 mask = N.clip(mask + newmask, 0, 1) 3897 fac = -fac 3898 termbf = abs(term) 3899 if arrayflag: 3900 return N.where( 3901 N.equal(frozen, -1), 1.0, frozen) # 1.0 if doesn't converge 3902 else: 3903 return N.where( 3904 N.equal(frozen, -1), 1.0, frozen)[0] # 1.0 if doesn't converge 3905 3906 def afprob(dfnum, dfden, F): 3907 """ 3908 Returns the 1-tailed significance level (p-value) of an F statistic 3909 given the degrees of freedom for the numerator (dfR-dfF) and the degrees 3910 of freedom for the denominator (dfF). Can handle multiple dims for F. 3911 3912 Usage: afprob(dfnum, dfden, F) where usually dfnum=dfbn, dfden=dfwn 3913 """ 3914 if type(F) == N.ndarray: 3915 return abetai(0.5 * dfden, 0.5 * dfnum, dfden / (1.0 * dfden + dfnum * F)) 3916 else: 3917 return abetai(0.5 * dfden, 0.5 * dfnum, dfden / float(dfden + dfnum * F)) 3918 3919 def abetacf(a, b, x, verbose=1): 3920 """ 3921 Evaluates the continued fraction form of the incomplete Beta function, 3922 betai. (Adapted from: Numerical Recipies in C.) Can handle multiple 3923 dimensions for x. 3924 3925 Usage: abetacf(a,b,x,verbose=1) 3926 """ 3927 ITMAX = 200 3928 EPS = 3.0e-7 3929 3930 arrayflag = 1 3931 if type(x) == N.ndarray: 3932 frozen = N.ones(x.shape, 3933 N.float_) * -1 #start out w/ -1s, should replace all 3934 else: 3935 arrayflag = 0 3936 frozen = N.array([-1]) 3937 x = N.array([x]) 3938 mask = N.zeros(x.shape) 3939 bm = az = am = 1.0 3940 qab = a + b 3941 qap = a + 1.0 3942 qam = a - 1.0 3943 bz = 1.0 - qab * x / qap 3944 for i in range(ITMAX + 1): 3945 if N.sum(N.ravel(N.equal(frozen, -1))) == 0: 3946 break 3947 em = float(i + 1) 3948 tem = em + em 3949 d = em * (b - em) * x / ((qam + tem) * (a + tem)) 3950 ap = az + d * am 3951 bp = bz + d * bm 3952 d = -(a + em) * (qab + em) * x / ((qap + tem) * (a + tem)) 3953 app = ap + d * az 3954 bpp = bp + d * bz 3955 aold = az * 1 3956 am = ap / bpp 3957 bm = bp / bpp 3958 az = app / bpp 3959 bz = 1.0 3960 newmask = N.less(abs(az - aold), EPS * abs(az)) 3961 frozen = N.where(newmask * N.equal(mask, 0), az, frozen) 3962 mask = N.clip(mask + newmask, 0, 1) 3963 noconverge = asum(N.equal(frozen, -1)) 3964 if noconverge <> 0 and verbose: 3965 print 'a or b too big, or ITMAX too small in Betacf for ', noconverge, ' elements' 3966 if arrayflag: 3967 return frozen 3968 else: 3969 return frozen[0] 3970 3971 def agammln(xx): 3972 """ 3973 Returns the gamma function of xx. 3974 Gamma(z) = Integral(0,infinity) of t^(z-1)exp(-t) dt. 3975 Adapted from: Numerical Recipies in C. Can handle multiple dims ... but 3976 probably doesn't normally have to. 3977 3978 Usage: agammln(xx) 3979 """ 3980 coeff = [76.18009173, -86.50532033, 24.01409822, -1.231739516, 3981 0.120858003e-2, -0.536382e-5] 3982 x = xx - 1.0 3983 tmp = x + 5.5 3984 tmp = tmp - (x + 0.5) * N.log(tmp) 3985 ser = 1.0 3986 for j in range(len(coeff)): 3987 x = x + 1 3988 ser = ser + coeff[j] / x 3989 return -tmp + N.log(2.50662827465 * ser) 3990 3991 def abetai(a, b, x, verbose=1): 3992 """ 3993 Returns the incomplete beta function: 3994 3995 I-sub-x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt) 3996 3997 where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma 3998 function of a. The continued fraction formulation is implemented 3999 here, using the betacf function. (Adapted from: Numerical Recipies in 4000 C.) Can handle multiple dimensions. 4001 4002 Usage: abetai(a,b,x,verbose=1) 4003 """ 4004 TINY = 1e-15 4005 if type(a) == N.ndarray: 4006 if asum(N.less(x, 0) + N.greater(x, 1)) <> 0: 4007 raise ValueError, 'Bad x in abetai' 4008 x = N.where(N.equal(x, 0), TINY, x) 4009 x = N.where(N.equal(x, 1.0), 1 - TINY, x) 4010 4011 bt = N.where(N.equal(x, 0) + N.equal(x, 1), 0, -1) 4012 exponents = (gammln(a + b) - gammln(a) - gammln(b) + a * N.log(x) + b * 4013 N.log(1.0 - x)) 4014 # 746 (below) is the MAX POSSIBLE BEFORE OVERFLOW 4015 exponents = N.where(N.less(exponents, -740), -740, exponents) 4016 bt = N.exp(exponents) 4017 if type(x) == N.ndarray: 4018 ans = N.where( 4019 N.less(x, (a + 1) / (a + b + 2.0)), bt * abetacf(a, b, x, verbose) / 4020 float(a), 1.0 - bt * abetacf(b, a, 1.0 - x, verbose) / float(b)) 4021 else: 4022 if x < (a + 1) / (a + b + 2.0): 4023 ans = bt * abetacf(a, b, x, verbose) / float(a) 4024 else: 4025 ans = 1.0 - bt * abetacf(b, a, 1.0 - x, verbose) / float(b) 4026 return ans 4027 4028 ##################################### 4029 ####### AANOVA CALCULATIONS ####### 4030 ##################################### 4031 4032 import numpy.linalg, operator 4033 LA = numpy.linalg 4034 4035 def aglm(data, para): 4036 """ 4037 Calculates a linear model fit ... anova/ancova/lin-regress/t-test/etc. Taken 4038 from: 4039 Peterson et al. Statistical limitations in functional neuroimaging 4040 I. Non-inferential methods and statistical models. Phil Trans Royal Soc 4041 Lond B 354: 1239-1260. 4042 4043 Usage: aglm(data,para) 4044 Returns: statistic, p-value ??? 4045 """ 4046 if len(para) <> len(data): 4047 print 'data and para must be same length in aglm' 4048 return 4049 n = len(para) 4050 p = pstat.aunique(para) 4051 x = N.zeros((n, len(p))) # design matrix 4052 for l in range(len(p)): 4053 x[:, l] = N.equal(para, p[l]) 4054 b = N.dot( 4055 N.dot( 4056 LA.inv(N.dot( 4057 N.transpose(x), x)), # i.e., b=inv(X'X)X'Y 4058 N.transpose(x)), 4059 data) 4060 diffs = (data - N.dot(x, b)) 4061 s_sq = 1. / (n - len(p)) * N.dot(N.transpose(diffs), diffs) 4062 4063 if len(p) == 2: # ttest_ind 4064 c = N.array([1, -1]) 4065 df = n - 2 4066 fact = asum(1.0 / asum(x, 0)) # i.e., 1/n1 + 1/n2 + 1/n3 ... 4067 t = N.dot(c, b) / N.sqrt(s_sq * fact) 4068 probs = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 4069 return t, probs 4070 4071 def aF_oneway(*args): 4072 """ 4073 Performs a 1-way ANOVA, returning an F-value and probability given 4074 any number of groups. From Heiman, pp.394-7. 4075 4076 Usage: aF_oneway (*args) where *args is 2 or more arrays, one per 4077 treatment group 4078 Returns: f-value, probability 4079 """ 4080 na = len(args) # ANOVA on 'na' groups, each in it's own array 4081 means = [0] * na 4082 vars = [0] * na 4083 ns = [0] * na 4084 alldata = [] 4085 tmp = map(N.array, args) 4086 means = map(amean, tmp) 4087 vars = map(avar, tmp) 4088 ns = map(len, args) 4089 alldata = N.concatenate(args) 4090 bign = len(alldata) 4091 sstot = ass(alldata) - (asquare_of_sums(alldata) / float(bign)) 4092 ssbn = 0 4093 for a in args: 4094 ssbn = ssbn + asquare_of_sums(N.array(a)) / float(len(a)) 4095 ssbn = ssbn - (asquare_of_sums(alldata) / float(bign)) 4096 sswn = sstot - ssbn 4097 dfbn = na - 1 4098 dfwn = bign - na 4099 msb = ssbn / float(dfbn) 4100 msw = sswn / float(dfwn) 4101 f = msb / msw 4102 prob = fprob(dfbn, dfwn, f) 4103 return f, prob 4104 4105 def aF_value(ER, EF, dfR, dfF): 4106 """ 4107 Returns an F-statistic given the following: 4108 ER = error associated with the null hypothesis (the Restricted model) 4109 EF = error associated with the alternate hypothesis (the Full model) 4110 dfR = degrees of freedom the Restricted model 4111 dfF = degrees of freedom associated with the Restricted model 4112 """ 4113 return ((ER - EF) / float(dfR - dfF) / (EF / float(dfF))) 4114 4115 def outputfstats(Enum, Eden, dfnum, dfden, f, prob): 4116 Enum = round(Enum, 3) 4117 Eden = round(Eden, 3) 4118 dfnum = round(Enum, 3) 4119 dfden = round(dfden, 3) 4120 f = round(f, 3) 4121 prob = round(prob, 3) 4122 suffix = '' # for *s after the p-value 4123 if prob < 0.001: 4124 suffix = ' ***' 4125 elif prob < 0.01: 4126 suffix = ' **' 4127 elif prob < 0.05: 4128 suffix = ' *' 4129 title = [['EF/ER', 'DF', 'Mean Square', 'F-value', 'prob', '']] 4130 lofl = title + [[Enum, dfnum, round(Enum / float(dfnum), 3), f, prob, suffix 4131 ], [Eden, dfden, round(Eden / float(dfden), 3), '', '', '']] 4132 pstat.printcc(lofl) 4133 return 4134 4135 def F_value_multivariate(ER, EF, dfnum, dfden): 4136 """ 4137 Returns an F-statistic given the following: 4138 ER = error associated with the null hypothesis (the Restricted model) 4139 EF = error associated with the alternate hypothesis (the Full model) 4140 dfR = degrees of freedom the Restricted model 4141 dfF = degrees of freedom associated with the Restricted model 4142 where ER and EF are matrices from a multivariate F calculation. 4143 """ 4144 if type(ER) in [IntType, FloatType]: 4145 ER = N.array([[ER]]) 4146 if type(EF) in [IntType, FloatType]: 4147 EF = N.array([[EF]]) 4148 n_um = (LA.det(ER) - LA.det(EF)) / float(dfnum) 4149 d_en = LA.det(EF) / float(dfden) 4150 return n_um / d_en 4151 4152 ##################################### 4153 ####### ASUPPORT FUNCTIONS ######## 4154 ##################################### 4155 4156 def asign(a): 4157 """ 4158 Usage: asign(a) 4159 Returns: array shape of a, with -1 where a<0 and +1 where a>=0 4160 """ 4161 a = N.asarray(a) 4162 if ((type(a) == type(1.4)) or (type(a) == type(1))): 4163 return a - a - N.less(a, 0) + N.greater(a, 0) 4164 else: 4165 return N.zeros(N.shape(a)) - N.less(a, 0) + N.greater(a, 0) 4166 4167 def asum(a, dimension=None, keepdims=0): 4168 """ 4169 An alternative to the Numeric.add.reduce function, which allows one to 4170 (1) collapse over multiple dimensions at once, and/or (2) to retain 4171 all dimensions in the original array (squashing one down to size. 4172 Dimension can equal None (ravel array first), an integer (the 4173 dimension over which to operate), or a sequence (operate over multiple 4174 dimensions). If keepdims=1, the resulting array will have as many 4175 dimensions as the input array. 4176 4177 Usage: asum(a, dimension=None, keepdims=0) 4178 Returns: array summed along 'dimension'(s), same _number_ of dims if keepdims=1 4179 """ 4180 if type(a) == N.ndarray and a.dtype in [N.int_, N.short, N.ubyte]: 4181 a = a.astype(N.float_) 4182 if dimension == None: 4183 s = N.sum(N.ravel(a)) 4184 elif type(dimension) in [IntType, FloatType]: 4185 s = N.add.reduce(a, dimension) 4186 if keepdims == 1: 4187 shp = list(a.shape) 4188 shp[dimension] = 1 4189 s = N.reshape(s, shp) 4190 else: # must be a SEQUENCE of dims to sum over 4191 dims = list(dimension) 4192 dims.sort() 4193 dims.reverse() 4194 s = a * 1.0 4195 for dim in dims: 4196 s = N.add.reduce(s, dim) 4197 if keepdims == 1: 4198 shp = list(a.shape) 4199 for dim in dims: 4200 shp[dim] = 1 4201 s = N.reshape(s, shp) 4202 return s 4203 4204 def acumsum(a, dimension=None): 4205 """ 4206 Returns an array consisting of the cumulative sum of the items in the 4207 passed array. Dimension can equal None (ravel array first), an 4208 integer (the dimension over which to operate), or a sequence (operate 4209 over multiple dimensions, but this last one just barely makes sense). 4210 4211 Usage: acumsum(a,dimension=None) 4212 """ 4213 if dimension == None: 4214 a = N.ravel(a) 4215 dimension = 0 4216 if type(dimension) in [ListType, TupleType, N.ndarray]: 4217 dimension = list(dimension) 4218 dimension.sort() 4219 dimension.reverse() 4220 for d in dimension: 4221 a = N.add.accumulate(a, d) 4222 return a 4223 else: 4224 return N.add.accumulate(a, dimension) 4225 4226 def ass(inarray, dimension=None, keepdims=0): 4227 """ 4228 Squares each value in the passed array, adds these squares & returns 4229 the result. Unfortunate function name. :-) Defaults to ALL values in 4230 the array. Dimension can equal None (ravel array first), an integer 4231 (the dimension over which to operate), or a sequence (operate over 4232 multiple dimensions). Set keepdims=1 to maintain the original number 4233 of dimensions. 4234 4235 Usage: ass(inarray, dimension=None, keepdims=0) 4236 Returns: sum-along-'dimension' for (inarray*inarray) 4237 """ 4238 if dimension == None: 4239 inarray = N.ravel(inarray) 4240 dimension = 0 4241 return asum(inarray * inarray, dimension, keepdims) 4242 4243 def asummult(array1, array2, dimension=None, keepdims=0): 4244 """ 4245 Multiplies elements in array1 and array2, element by element, and 4246 returns the sum (along 'dimension') of all resulting multiplications. 4247 Dimension can equal None (ravel array first), an integer (the 4248 dimension over which to operate), or a sequence (operate over multiple 4249 dimensions). A trivial function, but included for completeness. 4250 4251 Usage: asummult(array1,array2,dimension=None,keepdims=0) 4252 """ 4253 if dimension == None: 4254 array1 = N.ravel(array1) 4255 array2 = N.ravel(array2) 4256 dimension = 0 4257 return asum(array1 * array2, dimension, keepdims) 4258 4259 def asquare_of_sums(inarray, dimension=None, keepdims=0): 4260 """ 4261 Adds the values in the passed array, squares that sum, and returns the 4262 result. Dimension can equal None (ravel array first), an integer (the 4263 dimension over which to operate), or a sequence (operate over multiple 4264 dimensions). If keepdims=1, the returned array will have the same 4265 NUMBER of dimensions as the original. 4266 4267 Usage: asquare_of_sums(inarray, dimension=None, keepdims=0) 4268 Returns: the square of the sum over dim(s) in dimension 4269 """ 4270 if dimension == None: 4271 inarray = N.ravel(inarray) 4272 dimension = 0 4273 s = asum(inarray, dimension, keepdims) 4274 if type(s) == N.ndarray: 4275 return s.astype(N.float_) * s 4276 else: 4277 return float(s) * s 4278 4279 def asumdiffsquared(a, b, dimension=None, keepdims=0): 4280 """ 4281 Takes pairwise differences of the values in arrays a and b, squares 4282 these differences, and returns the sum of these squares. Dimension 4283 can equal None (ravel array first), an integer (the dimension over 4284 which to operate), or a sequence (operate over multiple dimensions). 4285 keepdims=1 means the return shape = len(a.shape) = len(b.shape) 4286 4287 Usage: asumdiffsquared(a,b) 4288 Returns: sum[ravel(a-b)**2] 4289 """ 4290 if dimension == None: 4291 inarray = N.ravel(a) 4292 dimension = 0 4293 return asum((a - b)**2, dimension, keepdims) 4294 4295 def ashellsort(inarray): 4296 """ 4297 Shellsort algorithm. Sorts a 1D-array. 4298 4299 Usage: ashellsort(inarray) 4300 Returns: sorted-inarray, sorting-index-vector (for original array) 4301 """ 4302 n = len(inarray) 4303 svec = inarray * 1.0 4304 ivec = range(n) 4305 gap = n / 2 # integer division needed 4306 while gap > 0: 4307 for i in range(gap, n): 4308 for j in range(i - gap, -1, -gap): 4309 while j >= 0 and svec[j] > svec[j + gap]: 4310 temp = svec[j] 4311 svec[j] = svec[j + gap] 4312 svec[j + gap] = temp 4313 itemp = ivec[j] 4314 ivec[j] = ivec[j + gap] 4315 ivec[j + gap] = itemp 4316 gap = gap / 2 # integer division needed 4317 # svec is now sorted input vector, ivec has the order svec[i] = vec[ivec[i]] 4318 return svec, ivec 4319 4320 def arankdata(inarray): 4321 """ 4322 Ranks the data in inarray, dealing with ties appropritely. Assumes 4323 a 1D inarray. Adapted from Gary Perlman's |Stat ranksort. 4324 4325 Usage: arankdata(inarray) 4326 Returns: array of length equal to inarray, containing rank scores 4327 """ 4328 n = len(inarray) 4329 svec, ivec = ashellsort(inarray) 4330 sumranks = 0 4331 dupcount = 0 4332 newarray = N.zeros(n, N.float_) 4333 for i in range(n): 4334 sumranks = sumranks + i 4335 dupcount = dupcount + 1 4336 if i == n - 1 or svec[i] <> svec[i + 1]: 4337 averank = sumranks / float(dupcount) + 1 4338 for j in range(i - dupcount + 1, i + 1): 4339 newarray[ivec[j]] = averank 4340 sumranks = 0 4341 dupcount = 0 4342 return newarray 4343 4344 def afindwithin(data): 4345 """ 4346 Returns a binary vector, 1=within-subject factor, 0=between. Input 4347 equals the entire data array (i.e., column 1=random factor, last 4348 column = measured values. 4349 4350 Usage: afindwithin(data) data in |Stat format 4351 """ 4352 numfact = len(data[0]) - 2 4353 withinvec = [0] * numfact 4354 for col in range(1, numfact + 1): 4355 rows = pstat.linexand(data, col, pstat.unique(pstat.colex(data, 1))[0] 4356 ) # get 1 level of this factor 4357 if len(pstat.unique(pstat.colex(rows, 0))) < len( 4358 rows): # if fewer subjects than scores on this factor 4359 withinvec[col - 1] = 1 4360 return withinvec 4361 4362 ######################################################### 4363 ######################################################### 4364 ###### RE-DEFINE DISPATCHES TO INCLUDE ARRAYS ######### 4365 ######################################################### 4366 ######################################################### 4367 4368 ## CENTRAL TENDENCY: 4369 geometricmean = Dispatch( 4370 (lgeometricmean, (ListType, TupleType)), (ageometricmean, (N.ndarray,))) 4371 harmonicmean = Dispatch( 4372 (lharmonicmean, (ListType, TupleType)), (aharmonicmean, (N.ndarray,))) 4373 mean = Dispatch((lmean, (ListType, TupleType)), (amean, (N.ndarray,))) 4374 median = Dispatch((lmedian, (ListType, TupleType)), (amedian, (N.ndarray,))) 4375 medianscore = Dispatch( 4376 (lmedianscore, (ListType, TupleType)), (amedianscore, (N.ndarray,))) 4377 mode = Dispatch((lmode, (ListType, TupleType)), (amode, (N.ndarray,))) 4378 tmean = Dispatch((atmean, (N.ndarray,))) 4379 tvar = Dispatch((atvar, (N.ndarray,))) 4380 tstdev = Dispatch((atstdev, (N.ndarray,))) 4381 tsem = Dispatch((atsem, (N.ndarray,))) 4382 4383 ## VARIATION: 4384 moment = Dispatch((lmoment, (ListType, TupleType)), (amoment, (N.ndarray,))) 4385 variation = Dispatch( 4386 (lvariation, (ListType, TupleType)), (avariation, (N.ndarray,))) 4387 skew = Dispatch((lskew, (ListType, TupleType)), (askew, (N.ndarray,))) 4388 kurtosis = Dispatch( 4389 (lkurtosis, (ListType, TupleType)), (akurtosis, (N.ndarray,))) 4390 describe = Dispatch( 4391 (ldescribe, (ListType, TupleType)), (adescribe, (N.ndarray,))) 4392 4393 ## DISTRIBUTION TESTS 4394 4395 skewtest = Dispatch( 4396 (askewtest, (ListType, TupleType)), (askewtest, (N.ndarray,))) 4397 kurtosistest = Dispatch( 4398 (akurtosistest, (ListType, TupleType)), (akurtosistest, (N.ndarray,))) 4399 normaltest = Dispatch( 4400 (anormaltest, (ListType, TupleType)), (anormaltest, (N.ndarray,))) 4401 4402 ## FREQUENCY STATS: 4403 itemfreq = Dispatch( 4404 (litemfreq, (ListType, TupleType)), (aitemfreq, (N.ndarray,))) 4405 scoreatpercentile = Dispatch( 4406 (lscoreatpercentile, (ListType, TupleType)), (ascoreatpercentile, 4407 (N.ndarray,))) 4408 percentileofscore = Dispatch( 4409 (lpercentileofscore, (ListType, TupleType)), (apercentileofscore, 4410 (N.ndarray,))) 4411 histogram = Dispatch( 4412 (lhistogram, (ListType, TupleType)), (ahistogram, (N.ndarray,))) 4413 cumfreq = Dispatch( 4414 (lcumfreq, (ListType, TupleType)), (acumfreq, (N.ndarray,))) 4415 relfreq = Dispatch( 4416 (lrelfreq, (ListType, TupleType)), (arelfreq, (N.ndarray,))) 4417 4418 ## VARIABILITY: 4419 obrientransform = Dispatch( 4420 (lobrientransform, (ListType, TupleType)), (aobrientransform, 4421 (N.ndarray,))) 4422 samplevar = Dispatch( 4423 (lsamplevar, (ListType, TupleType)), (asamplevar, (N.ndarray,))) 4424 samplestdev = Dispatch( 4425 (lsamplestdev, (ListType, TupleType)), (asamplestdev, (N.ndarray,))) 4426 signaltonoise = Dispatch((asignaltonoise, (N.ndarray,)),) 4427 var = Dispatch((lvar, (ListType, TupleType)), (avar, (N.ndarray,))) 4428 stdev = Dispatch((lstdev, (ListType, TupleType)), (astdev, (N.ndarray,))) 4429 sterr = Dispatch((lsterr, (ListType, TupleType)), (asterr, (N.ndarray,))) 4430 sem = Dispatch((lsem, (ListType, TupleType)), (asem, (N.ndarray,))) 4431 z = Dispatch((lz, (ListType, TupleType)), (az, (N.ndarray,))) 4432 zs = Dispatch((lzs, (ListType, TupleType)), (azs, (N.ndarray,))) 4433 4434 ## TRIMMING FCNS: 4435 threshold = Dispatch((athreshold, (N.ndarray,)),) 4436 trimboth = Dispatch( 4437 (ltrimboth, (ListType, TupleType)), (atrimboth, (N.ndarray,))) 4438 trim1 = Dispatch((ltrim1, (ListType, TupleType)), (atrim1, (N.ndarray,))) 4439 4440 ## CORRELATION FCNS: 4441 paired = Dispatch((lpaired, (ListType, TupleType)), (apaired, (N.ndarray,))) 4442 lincc = Dispatch((llincc, (ListType, TupleType)), (alincc, (N.ndarray,))) 4443 pearsonr = Dispatch( 4444 (lpearsonr, (ListType, TupleType)), (apearsonr, (N.ndarray,))) 4445 spearmanr = Dispatch( 4446 (lspearmanr, (ListType, TupleType)), (aspearmanr, (N.ndarray,))) 4447 pointbiserialr = Dispatch( 4448 (lpointbiserialr, (ListType, TupleType)), (apointbiserialr, (N.ndarray,))) 4449 kendalltau = Dispatch( 4450 (lkendalltau, (ListType, TupleType)), (akendalltau, (N.ndarray,))) 4451 linregress = Dispatch( 4452 (llinregress, (ListType, TupleType)), (alinregress, (N.ndarray,))) 4453 4454 ## INFERENTIAL STATS: 4455 ttest_1samp = Dispatch( 4456 (lttest_1samp, (ListType, TupleType)), (attest_1samp, (N.ndarray,))) 4457 ttest_ind = Dispatch( 4458 (lttest_ind, (ListType, TupleType)), (attest_ind, (N.ndarray,))) 4459 ttest_rel = Dispatch( 4460 (lttest_rel, (ListType, TupleType)), (attest_rel, (N.ndarray,))) 4461 chisquare = Dispatch( 4462 (lchisquare, (ListType, TupleType)), (achisquare, (N.ndarray,))) 4463 ks_2samp = Dispatch( 4464 (lks_2samp, (ListType, TupleType)), (aks_2samp, (N.ndarray,))) 4465 mannwhitneyu = Dispatch( 4466 (lmannwhitneyu, (ListType, TupleType)), (amannwhitneyu, (N.ndarray,))) 4467 tiecorrect = Dispatch( 4468 (ltiecorrect, (ListType, TupleType)), (atiecorrect, (N.ndarray,))) 4469 ranksums = Dispatch( 4470 (lranksums, (ListType, TupleType)), (aranksums, (N.ndarray,))) 4471 wilcoxont = Dispatch( 4472 (lwilcoxont, (ListType, TupleType)), (awilcoxont, (N.ndarray,))) 4473 kruskalwallish = Dispatch( 4474 (lkruskalwallish, (ListType, TupleType)), (akruskalwallish, (N.ndarray,))) 4475 friedmanchisquare = Dispatch( 4476 (lfriedmanchisquare, (ListType, TupleType)), (afriedmanchisquare, 4477 (N.ndarray,))) 4478 4479 ## PROBABILITY CALCS: 4480 chisqprob = Dispatch( 4481 (lchisqprob, (IntType, FloatType)), (achisqprob, (N.ndarray,))) 4482 zprob = Dispatch((lzprob, (IntType, FloatType)), (azprob, (N.ndarray,))) 4483 ksprob = Dispatch((lksprob, (IntType, FloatType)), (aksprob, (N.ndarray,))) 4484 fprob = Dispatch((lfprob, (IntType, FloatType)), (afprob, (N.ndarray,))) 4485 betacf = Dispatch((lbetacf, (IntType, FloatType)), (abetacf, (N.ndarray,))) 4486 betai = Dispatch((lbetai, (IntType, FloatType)), (abetai, (N.ndarray,))) 4487 erfcc = Dispatch((lerfcc, (IntType, FloatType)), (aerfcc, (N.ndarray,))) 4488 gammln = Dispatch((lgammln, (IntType, FloatType)), (agammln, (N.ndarray,))) 4489 4490 ## ANOVA FUNCTIONS: 4491 F_oneway = Dispatch( 4492 (lF_oneway, (ListType, TupleType)), (aF_oneway, (N.ndarray,))) 4493 F_value = Dispatch( 4494 (lF_value, (ListType, TupleType)), (aF_value, (N.ndarray,))) 4495 4496 ## SUPPORT FUNCTIONS: 4497 incr = Dispatch((lincr, (ListType, TupleType, N.ndarray)),) 4498 sum = Dispatch((lsum, (ListType, TupleType)), (asum, (N.ndarray,))) 4499 cumsum = Dispatch((lcumsum, (ListType, TupleType)), (acumsum, (N.ndarray,))) 4500 ss = Dispatch((lss, (ListType, TupleType)), (ass, (N.ndarray,))) 4501 summult = Dispatch( 4502 (lsummult, (ListType, TupleType)), (asummult, (N.ndarray,))) 4503 square_of_sums = Dispatch( 4504 (lsquare_of_sums, (ListType, TupleType)), (asquare_of_sums, (N.ndarray,))) 4505 sumdiffsquared = Dispatch( 4506 (lsumdiffsquared, (ListType, TupleType)), (asumdiffsquared, (N.ndarray,))) 4507 shellsort = Dispatch( 4508 (lshellsort, (ListType, TupleType)), (ashellsort, (N.ndarray,))) 4509 rankdata = Dispatch( 4510 (lrankdata, (ListType, TupleType)), (arankdata, (N.ndarray,))) 4511 findwithin = Dispatch( 4512 (lfindwithin, (ListType, TupleType)), (afindwithin, (N.ndarray,))) 4513 4514 ###################### END OF NUMERIC FUNCTION BLOCK ##################### 4515 4516 ###################### END OF STATISTICAL FUNCTIONS ###################### 4517 4518 except ImportError: 4519 pass 4520