Simple statistics with SciPy

Simple statistics with SciPy

Introduction

Scipy, and Numpy, provide a host of functions for performing statistical calculations. This article will describe ways of performing a few simple statistical calculations, as an introduction to using Scipy.

Scipy package is organized into several sub-packages. Before using any of these sub-packages, it must be explicitly imported. For example, to use functions in the scipy.stats package, we must execute:

from scipy import stats

Scipy is built on top of Numpy and uses Numpy arrays and data types. Hence we can use all the array manipulation and indexing methods provided by Numpy. Scipy imports all functions in the Numpy package, and several commonly used functions from sub-packages, into the top level namespace. For example, the Numpy array function is available as scipy.array. Similarly, the function scipy.var is the same as numpy.var. This means that we don’t have to explicitly import Numpy.

In this article we will use the matplotlib library for creating plots.

By convention, the top level Scipy package is imported as sp and the Numpy package, if needed, is imported as np.

import scipy as sp
import numpy as np

Matplotlib is imported in the following manner:

import matplotlib as mpl
from matplotlib import pyplot as plt

To get help on any Scipy function or sub-package we can use the Scipy function info. For example, scipy.info(scipy.stats).

Descriptive statistics

To illustrate basic functions we will use some pseudo-random numbers from a Gaussian or Normal distribution. The function scipy.randn can be used to generate random numbers from a standard Gaussian. This function is the same as the numpy.random.randn function.

>>> s = sp.randn(100) # Hundred random numbers from a standard Gaussian
>>> print len(s)
100

Since the value returned is a Numpy array we can use its methods to find descriptive statistics for the data.

>>> print("Mean : {0:8.6f}".format(s.mean()))
Mean           : -0.105170
>>> print("Minimum : {0:8.6f}".format(s.min()))
Minimum        : -3.091940
>>> print("Maximum : {0:8.6f}".format(s.max()))
Maximum        : 2.193828
>>> print("Variance : {0:8.6f}".format(s.var()))
Variance       : 0.863475
>>> print("Std. deviation : {0:8.6f}".format(s.std()))
Std. deviation : 0.929233

We can also use Numpy functions for performing the same calculations.

>>> print("Mean : {0:8.6f}".format(sp.mean(s)))
Mean           : -0.105170
>>> print("Variance : {0:8.6f}".format(sp.var(s)))
Variance       : 0.863475
>>> print("Std. deviation : {0:8.6f}".format(sp.std(s)))
Std. deviation : 0.929233

The median can be calculated using:

>>> print("Median : {0:8.6f}".format(sp.median(s)))
Median          : 0.001611

Note that both variance calculations above use N in the denominator i.e., they are biased estimators of the variance of the parent distribution. But when we are merely trying to describe the data, these are the appropriate equations to use.

The scipy.var function takes a keyword parameter ddof that can be used to change the denominator. The denominator is set as N - ddof and by default ddof = 0; set ddof = 1.0 to find the unbiased estimator of the variance of the parent distribution.

>>> print("Variance with N in denominator = {0:8.6f}".format(sp.var(s)))
Variance (N in denominator)      =  0.863475
>>> print("Variance with N - 1 in denominator = {0:8.6f}".format(sp.var(s,ddof=1)))
Variance (N - 1 in denominator)  =  0.872197

The scipy.stats sub-package has a function describe that will provide most of the above numbers. In this case, the variance has N - 1 in the denominator.

>>> from scipy import stats
>>> n, min_max, mean, var, skew, kurt = stats.describe(s)
>>> print("Number of elements: {0:d}".format(n))
Number of elements: 100
>>> print("Minimum: {0:8.6f} Maximum: {1:8.6f}".format(min_max[0], min_max[1]))
Minimum: -3.091940 Maximum: 2.193828
>>> print("Mean: {0:8.6f}".format(mean))
Mean: -0.105170
>>> print("Variance: {0:8.6f}".format(var))
Variance: 0.872197
>>> print("Skew : {0:8.6f}".format(skew))
Skew : -0.146500
>>> print("Kurtosis: {0:8.6f}".format(kurt))
Kurtosis: 0.117884

Probability distributions

Scipy has functions that deal with several common probability distributions. Currently there are 81 continuous probability distributions and 10 discrete distributions. These are defined in the scipy.stats sub-package. This package also defines several statistical functions.

Examples of continuous probability distributions are:

  • norm : Normal or Gaussian
  • chi2 : Chi-squared
  • t : Student’s T
  • uniform : Uniform

Examples of discrete probability distributions are:

  • binom : Binomial
  • poisson : Poisson

Examples of statistical functions:

  • mode : Modal value
  • moment : central moment
  • describe: descriptive statistics
  • histogram: histogram of data

There are two ways of using probability distribution functions.

  1. Generate a “frozen” distribution object and then work with the methods of this object. A “frozen” distribution is one with its parameters set to specific values. For example a Gaussian with mean = 3.5 and standard deviation = 2.0.

    To create a “frozen” Gaussian or Normal distribution with mean = 3.5 and standard deviation = 2.0 we can use.

    >>> n = stats.norm(loc=3.5, scale=2.0)

    Then to draw a random number from this distribution we can execute:

    >>> n.rvs()
    5.21244643857

    For a normal distribution the keyword parameter loc defines the mean and the keyword parameter scale defines the standard deviation. For other distributions these will correspond to appropriate parameters of the distribution; the parameters needed by a distribution is specified in the docstring of the distribution, which can be viewed with the Python help function. For example, to view the docstring for the Poisson distribution we can use help(stats.poisson) or scipy.info(stats.poisson). In the iPython shell we can use stats.poisson?.

  2. Use functions in the appropriate class by always passing the parameters that define the distribution, when calling functions associated with the distribution.

    For example, to draw a random number from a Gaussian or Normal distribution with mean = 3.5 and standard deviation = 2.0 we can use:

    >>> stats.norm.rvs(loc=3.5, scale=2.0)
    3.58150292503

Given a probability distribution function we are interested in obtaining properties of the distribution. Functions are available for calculating the following important properties.

Probability density function (PDF) and probability mass function (PMF)

For continuous variates the probability density function (PDF) is proportional to the probability of the variate being in a small interval about the given value. The probability of the variate being in a finite interval is the integral of the PDF over the interval.

The value of the PDF at any value of the variate can be obtained using the function pdf of the concerned distribution.

>>> # PDF of Gaussian of mean = 0.0 and std. deviation = 1.0 at 0.
>>> stats.norm.pdf(0, loc=0.0, scale=1.0)
0.3989422804014327

We can also pass an array of values to this function, to get the PDF at the specified values of the variate:

>>> stats.norm.pdf([-0.1, 0.0, 0.1], loc=0.0, scale=1.0)
array([ 0.39695255,  0.39894228,  0.39695255])

For discrete variates the probability mass function (PMF) gives the probability of the variate having a value x.

For example, for a binomial distribution with p = 0.5 and number of trials n = 10 we can calculate the PMF using the following:

>>> tries = range(11) # 0 to 10
>>> print(stats.binom.pmf(tries, 10, 0.5))
[ 0.00097656  0.00976563  0.04394531  0.1171875   0.20507813  0.24609375
0.20507813  0.1171875   0.04394531  0.00976563  0.00097656]

The following function plots the PMF of a binomial distribution with n tries and probability of success p.

def binom_pmf(n=4, p=0.5):
    # There are n+1 possible number of "successes": 0 to n.
    x = range(n+1)
    y = stats.binom.pmf(x, n, p)
    plt.plot(x,y,"o", color="black")

    # Format x-axis and y-axis.
    plt.axis([-(max(x)-min(x))*0.05, max(x)*1.05, -0.01, max(y)*1.10])
    plt.xticks(x)
    plt.title("Binomial distribution PMF for tries = {0} & p ={1}".format(
            n,p))
    plt.xlabel("Variate")
    plt.ylabel("Probability")

    plt.draw()

The plot for the binomial distribution with n = 0 and p = 0.5 is show below.

PMF for Binomial distribution

Cumulative density function (CDF)

CDF gives the probability that the variate has a value less than or equal to the given value.

>>> stats.norm.cdf(0.0, loc=0.0, scale=1.0)
0.5

The following function plots the CDF for a Gaussian distribution of given mean and standard deviation.

def norm_cdf(mean=0.0, std=1.0):
    # 50 numbers between -3σ and 3σ
    x = sp.linspace(-3*std, 3*std, 50)
    # CDF at these values
    y = stats.norm.cdf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Variate")
    plt.ylabel("Cumulative Probability")
    plt.title("CDF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

CDF for the Gaussian of mean = 0.0 and standard deviation = 1.0 is shown below.

CDF for Gaussian distribution

Percent point function (PPF) or inverse cumulative function

PPF is the inverse of the CDF. That is, PPF gives the value of the variate for which the cumulative probability has the given value.

>>> stats.norm.ppf(0.5, loc=0.0, scale=1.0)
0.0

The following function plots the PPF for a Gaussian distribution.

def norm_ppf(mean=0.0, std=1.0):
    # 100 numbers between 0 and 1.0 i.e., probabilities.
    x = sp.linspace(0, 1.0, 100)
    # PPF at these values
    y = stats.norm.ppf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Cumulative Probability")
    plt.ylabel("Variate")

    plt.title("PPF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

Example plot of PPF for Gaussian of mean = 0.0 and std. deviation = 1.0.

PPF for Gaussian distribution

Survival function (SF)

Survival function gives the probability that the variate has a value greater than the given value; SF = 1 - CDF.

>>> stats.norm.sf(0.0, loc=0.0, scale=1.0)
0.5

A function for plotting the SF for a Gaussian with a given mean and std. deviation is shown below. An example plot for mean = 0.0 and std. deviation = 1.0 is also shown.

def norm_sf(mean=0.0, std=1.0):
    # 50 numbers between -3σ and 3σ
    x = sp.linspace(-3*std, 3*std, 50)
    # SF at these values
    y = stats.norm.sf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Variate")
    plt.ylabel("Probability")
    plt.title("SF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

SF for Gaussian distribution.

Inverse survival function (ISF)

ISF is the inverse of the survival function. It gives the value of the variate for which the survival function has the given value.

>>> stats.norm.isf(0.5, loc=0.0, scale=1.0)
0.0

Function to plot the ISF, and an example plot, for a Gaussian is given below.

def norm_isf(mean=0.0, std=1.0):
    # 100 numbers between 0 and 1.0
    x = sp.linspace(0, 1.0, 100)
    # PPF at these values
    y = stats.norm.isf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Probability")
    plt.ylabel("Variate")

    plt.title("ISF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

ISF for Gaussian distribution.

Random variates

A random value or several random values can be drawn from a distribution using the rvs method of the appropriate class.

>>> # 100 random values from a Poisson distribution with mu = 1.0
>>> stats.norm.rvs(loc=0.0, scale=1.0, size=100)
>>> # 100 random values from a Poisson distribution with mu = 1.0
>>> stats.poisson.rvs(1.0, size=100)

The following plot shows the histogram of events recorded in a series of 2 second intervals. The data is simulated by drawing 100 random events from a Poisson distribution with mean μ = 1.69. Also plotted is the Probability Mass Function (PMF) of the Poisson distribution with mean μ = 1.69. Poisson is a discrete distribution and the solid line is just a smooth line drawn through the points of the Poisson distribution PMF. The mean μ and the standard deviation σ = sqrt(μ) are marked.

Poisson simulation

The code used to generate the plot is given below.

import scipy as sp
from scipy import stats
from matplotlib import pyplot as plt
from scipy import interpolate

def simulate_poisson():
    # Mean is 1.69
    mu = 1.69
    sigma = sp.sqrt(mu)
    mu_plus_sigma = mu + sigma

    # Draw random samples from the Poisson distribution, to simulate
    # the observed events per 2 second interval.
    counts = stats.poisson.rvs(mu, size=100)

    # Bins for the histogram: only the last bin is closed on both
    # sides. We need one more bin than the maximum value of count, so
    # that the maximum value goes in its own bin instead of getting
    # added to the previous bin.
    # [0,1), [1, 2), ..., [max(counts), max(counts)+1]
    bins = range(0, max(counts)+2)

    # Plot histogram.
    plt.hist(counts, bins=bins, align="left", histtype="step", color="black")

    # Create Poisson distribution for given mu.
    x = range(0,10)
    prob = stats.poisson.pmf(x, mu)*100 

    # Plot the PMF.
    plt.plot(x, prob, "o", color="black")

    # Draw a smooth curve through the PMF.
    l = sp.linspace(0,11,100)
    s = interpolate.spline(x, prob, l)
    plt.plot(l,s,color="black")

    plt.xlabel("Number of counts per 2 seconds")
    plt.ylabel("Number of occurrences (Poisson)")

    # Interpolated probability at x = μ + σ; for marking σ in the graph.
    xx = sp.searchsorted(l,mu_plus_sigma) - 1
    v = ((s[xx+1] -  s[xx])/(l[xx+1]-l[xx])) * (mu_plus_sigma - l[xx])
    v += s[xx]

    ax = plt.gca()
    # Reset axis range and ticks.
    ax.axis([-0.5,10, 0, 40])
    ax.set_xticks(range(1,10), minor=True)
    ax.set_yticks(range(0,41,8))
    ax.set_yticks(range(4,41,8), minor=True)

    # Draw arrow and then place an opaque box with μ in it.
    ax.annotate("", xy=(mu,29), xycoords="data", xytext=(mu, 13),
                textcoords="data", arrowprops=dict(arrowstyle="->",
                                                   connectionstyle="arc3"))
    bbox_props = dict(boxstyle="round", fc="w", ec="w")
    ax.text(mu, 21, r"$\mu$", va="center", ha="center",
            size=15, bbox=bbox_props)

    # Draw arrow and then place an opaque box with σ in it.
    ax.annotate("", xy=(mu,v), xytext=(mu_plus_sigma,v),
                arrowprops=dict(arrowstyle="<->", connectionstyle="arc3"))
    bbox_props = dict(boxstyle="round", fc="w", ec="w")
    ax.text(mu+(sigma/2.0), v, r"$\sigma$", va="center", ha="center",
            size=15, bbox=bbox_props)

    # Refresh plot and save figure.
    plt.draw()
    plt.savefig("simulate_poisson.png")

simulate_poisson()

More information

For more information on using Scipy consult the documentation at http://docs.scipy.org. The Scipy cookbook has several examples of using various sub-packages in Scipy; for example, fitting and evaluating polynomials. The Scipy mailing list is a good place to ask any questions on using Scipy. Planet Scipy has links to several Scipy related blogs.

Advertisements
This entry was posted in Python and tagged , . Bookmark the permalink.

15 Responses to Simple statistics with SciPy

  1. Pingback: [Link] Statistics with Python

  2. Nin says:

    Very nice job! This kind of examples were exactly what i was looking for!

  3. Chris says:

    Great job, thanks very much!

  4. E. John says:

    Beautiful. Thanks.

  5. Great overview, thanks!

  6. For statistics and associated graphics, the R environment has far more built-in tools than Python. R and its >5000 CRAN add-on packages is public domain with >100,000 statistical functions and several independent graphical systems … far larger than any other stat software, and its freely available (http://www.r-project.org). Professional data scientists recommend both R and Python for different purposes; see discussions at https://asaip.psu.edu/forums/software-forum. Here is an R script that reproduces the Poisson random variates plot shown above:

    mu <- 1.69 # user-supplied intensity
    counts <- rpois(100, mu) # Poisson random numbers
    temp <- hist(counts, breaks=seq(-0.5,9.5), plot=F) # save histogram values
    plot(temp$mids-0.5, temp$counts, type='s', xlim=c(-0.5,10), ylim=c(0,max(temp$counts)*1.05),
    + main='', xlab='Number of counts per 2 seconds', ylab='Number of occurrences (Poisson)')
    + # step plot
    points(temp$mids, temp$counts, pch=20) # add points
    lines(spline(temp$mids, temp$counts, n=80)) # add spline curve
    ytext1 <- temp$counts[trunc(mu)+2] # annotate with mean and arrow
    text(mu, ytext1/1.5, expression(mu))
    arrows(mu, ytext1/1.5+1, mu, ytext1-1, length=0.1)
    lines(c(mu,mu), c(ytext1/4,ytext1/1.5-1))
    ytext2 <- temp$counts[trunc(mu)+3] # annotate with sigma and arrows
    text(mu+sqrt(mu)/2, ytext2/1.5, expression(sigma))
    arrows(mu+sqrt(mu)/2+0.2, ytext2/1.5, mu+sqrt(mu), ytext2/1.5, length=0.07)
    arrows(mu+sqrt(mu)/2-0.2, ytext2/1.5, mu, ytext2/1.5, length=0.07)
    dev.copy2pdf(file='simulate_poisson.pdf') # save on disk

  7. Kumar says:

    Very nice introduction. Thanks

  8. nemackikutak says:

    Very nice! Perfect examples!

  9. Anthony The Koala says:

    This is about using the norm.pdf and computing the probability between two zvalues. I am having trouble because I get 1000 times the probability.

    Import the various packages

    import numpy as np
    from numpy import arange
    from matplotlib import pyplot
    from scipy.stats import norm
    

    First I generate the values of x between -3 and 3

    x_axis = arange(-3,3,0.001)

    Then I generate the gaussian pdf with mean of 0 and stddev of 1 associated with the particular x_axis

    y_axis = norm.pdf(x_axis,0,1
    

    I want to find the probability of z between -1 and 1

    selected_values = y_axis[(x_axis >= -1 ) & (x_axis <= 1)]   

    We that the total area of a gaussian distribution with mean = 0 and stdev = 1 between z=-1 and z=1 is approximately 0.682.
    But I get the total area to be 682, 1000 times greater than the expected value.

    sum(selected_values)
    682.68    # instead of 0.682
    

    We know that the area between z=+-2 is 0.95 approx and the area between z=+-3 to be 0.99 approx.
    BUT when I compute the respective probabilities, I get 953 and 997

    #Computing the area between z=+-2 expect area to be 0.953
    selected_values = y_axis[(x_axis >= -2 ) & (x_axis = -3 ) & (x_axis <= 3)]   
    sum(selected_values)
    997 #instead of 0.997
    

    Question I get similar numbers to the approximate area under the guassian curve BUT the magnitude of the calculations is out by 1000. i.e. instead of 0.682, 0.953 and 0.997 I get 682, 953 and 997.

  10. Anthony The Koala says:

    Apologies for the last part, but the question remains the same is why I get my results to be 1000 times the well-known approxiamations to z =+-2 and z=+-3

    #for z=+-2 std devs
    selected_values = y_axis[(x_axis >= -2 ) & (x_axis = -3 ) & (x_axis <= 3)]   
    sum(selected_values)
    997 #instead of 0.997
    

    I also forgot to say thanks for the previous post.

    Thanks
    Anthony of Sydney

  11. Anthony The Koala says:

    Apologies again, it appears that if I have more than a few lines of code, somehow the format gobbles it up and produces non-understandibie code. Broke my code into two parts..

    #For z = +-2, computing the area under the z curve
    selected_values = y_axis[(x_axis >= -2 ) & (x_axis <= 2)]   
    sum(selected_values)
    953.7  #instead of 0.9537
    

    AND

     
    #for z=+- 3 std devs computing the area under the z curve
    selected_values = y_axis[(x_axis >= -3 ) & (x_axis <= 3)]   
    sum(selected_values)
    997 #instead of 0.997
    

    My result is 1000 times the magnitude of the well known approximations for z=+-2 and z=+-3. Similarly my result was 10000 times the mangnitude of the well known approximations for z=+-1.

    Sorry and thanks,
    Anthony of Sydney

  12. Prasanth says:

    Hello,

    For a continuous probability distribution probability is area under curve and not just the sum of the pdf. An approximate way to calculate area is to multiply each value of pdf with the distance between the x values. In the example you give the distance between x values is 0.001 and this will give the correct answer.

    Regards,
    Prasanth

    • Anthony The Koala says:

      Dear Prasanth,
      Thank you for this answer, it is appreciated. I knew there was a factor of 1000 because of the distance between the x-values. You put it very nicely. I knew from elementary statistics at uni that it was the area under the curve. Again you put it very nicely.
      Many thanks again,
      Anthony of Sydney NSW

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s