Simple statistics with SciPy

Simple statistics with SciPy

Introduction

Scipy, and Numpy, provide a host of functions for performing statistical calculations. This article will describe ways of performing a few simple statistical calculations, as an introduction to using Scipy.

Scipy package is organized into several sub-packages. Before using any of these sub-packages, it must be explicitly imported. For example, to use functions in the scipy.stats package, we must execute:

from scipy import stats

Scipy is built on top of Numpy and uses Numpy arrays and data types. Hence we can use all the array manipulation and indexing methods provided by Numpy. Scipy imports all functions in the Numpy package, and several commonly used functions from sub-packages, into the top level namespace. For example, the Numpy array function is available as scipy.array. Similarly, the function scipy.var is the same as numpy.var. This means that we don’t have to explicitly import Numpy.

In this article we will use the matplotlib library for creating plots.

By convention, the top level Scipy package is imported as sp and the Numpy package, if needed, is imported as np.

import scipy as sp
import numpy as np

Matplotlib is imported in the following manner:

import matplotlib as mpl
from matplotlib import pyplot as plt

To get help on any Scipy function or sub-package we can use the Scipy function info. For example, scipy.info(scipy.stats).

Descriptive statistics

To illustrate basic functions we will use some pseudo-random numbers from a Gaussian or Normal distribution. The function scipy.randn can be used to generate random numbers from a standard Gaussian. This function is the same as the numpy.random.randn function.

>>> s = sp.randn(100) # Hundred random numbers from a standard Gaussian
>>> print len(s)
100

Since the value returned is a Numpy array we can use its methods to find descriptive statistics for the data.

>>> print("Mean : {0:8.6f}".format(s.mean()))
Mean           : -0.105170
>>> print("Minimum : {0:8.6f}".format(s.min()))
Minimum        : -3.091940
>>> print("Maximum : {0:8.6f}".format(s.max()))
Maximum        : 2.193828
>>> print("Variance : {0:8.6f}".format(s.var()))
Variance       : 0.863475
>>> print("Std. deviation : {0:8.6f}".format(s.std()))
Std. deviation : 0.929233

We can also use Numpy functions for performing the same calculations.

>>> print("Mean : {0:8.6f}".format(sp.mean(s)))
Mean           : -0.105170
>>> print("Variance : {0:8.6f}".format(sp.var(s)))
Variance       : 0.863475
>>> print("Std. deviation : {0:8.6f}".format(sp.std(s)))
Std. deviation : 0.929233

The median can be calculated using:

>>> print("Median : {0:8.6f}".format(sp.median(s)))
Median          : 0.001611

Note that both variance calculations above use N in the denominator i.e., they are biased estimators of the variance of the parent distribution. But when we are merely trying to describe the data, these are the appropriate equations to use.

The scipy.var function takes a keyword parameter ddof that can be used to change the denominator. The denominator is set as N - ddof and by default ddof = 0; set ddof = 1.0 to find the unbiased estimator of the variance of the parent distribution.

>>> print("Variance with N in denominator = {0:8.6f}".format(sp.var(s)))
Variance (N in denominator)      =  0.863475
>>> print("Variance with N - 1 in denominator = {0:8.6f}".format(sp.var(s,ddof=1)))
Variance (N - 1 in denominator)  =  0.872197

The scipy.stats sub-package has a function describe that will provide most of the above numbers. In this case, the variance has N - 1 in the denominator.

>>> from scipy import stats
>>> n, min_max, mean, var, skew, kurt = stats.describe(s)
>>> print("Number of elements: {0:d}".format(n))
Number of elements: 100
>>> print("Minimum: {0:8.6f} Maximum: {1:8.6f}".format(min_max[0], min_max[1]))
Minimum: -3.091940 Maximum: 2.193828
>>> print("Mean: {0:8.6f}".format(mean))
Mean: -0.105170
>>> print("Variance: {0:8.6f}".format(var))
Variance: 0.872197
>>> print("Skew : {0:8.6f}".format(skew))
Skew : -0.146500
>>> print("Kurtosis: {0:8.6f}".format(kurt))
Kurtosis: 0.117884

Probability distributions

Scipy has functions that deal with several common probability distributions. Currently there are 81 continuous probability distributions and 10 discrete distributions. These are defined in the scipy.stats sub-package. This package also defines several statistical functions.

Examples of continuous probability distributions are:

  • norm : Normal or Gaussian
  • chi2 : Chi-squared
  • t : Student’s T
  • uniform : Uniform

Examples of discrete probability distributions are:

  • binom : Binomial
  • poisson : Poisson

Examples of statistical functions:

  • mode : Modal value
  • moment : central moment
  • describe: descriptive statistics
  • histogram: histogram of data

There are two ways of using probability distribution functions.

  1. Generate a “frozen” distribution object and then work with the methods of this object. A “frozen” distribution is one with its parameters set to specific values. For example a Gaussian with mean = 3.5 and standard deviation = 2.0.

    To create a “frozen” Gaussian or Normal distribution with mean = 3.5 and standard deviation = 2.0 we can use.

    >>> n = stats.norm(loc=3.5, scale=2.0)

    Then to draw a random number from this distribution we can execute:

    >>> n.rvs()
    5.21244643857

    For a normal distribution the keyword parameter loc defines the mean and the keyword parameter scale defines the standard deviation. For other distributions these will correspond to appropriate parameters of the distribution; the parameters needed by a distribution is specified in the docstring of the distribution, which can be viewed with the Python help function. For example, to view the docstring for the Poisson distribution we can use help(stats.poisson) or scipy.info(stats.poisson). In the iPython shell we can use stats.poisson?.

  2. Use functions in the appropriate class by always passing the parameters that define the distribution, when calling functions associated with the distribution.

    For example, to draw a random number from a Gaussian or Normal distribution with mean = 3.5 and standard deviation = 2.0 we can use:

    >>> stats.norm.rvs(loc=3.5, scale=2.0)
    3.58150292503

Given a probability distribution function we are interested in obtaining properties of the distribution. Functions are available for calculating the following important properties.

Probability density function (PDF) and probability mass function (PMF)

For continuous variates the probability density function (PDF) is proportional to the probability of the variate being in a small interval about the given value. The probability of the variate being in a finite interval is the integral of the PDF over the interval.

The value of the PDF at any value of the variate can be obtained using the function pdf of the concerned distribution.

>>> # PDF of Gaussian of mean = 0.0 and std. deviation = 1.0 at 0.
>>> stats.norm.pdf(0, loc=0.0, scale=1.0)
0.3989422804014327

We can also pass an array of values to this function, to get the PDF at the specified values of the variate:

>>> stats.norm.pdf([-0.1, 0.0, 0.1], loc=0.0, scale=1.0)
array([ 0.39695255,  0.39894228,  0.39695255])

For discrete variates the probability mass function (PMF) gives the probability of the variate having a value x.

For example, for a binomial distribution with p = 0.5 and number of trials n = 10 we can calculate the PMF using the following:

>>> tries = range(11) # 0 to 10
>>> print(stats.binom.pmf(tries, 10, 0.5))
[ 0.00097656  0.00976563  0.04394531  0.1171875   0.20507813  0.24609375
0.20507813  0.1171875   0.04394531  0.00976563  0.00097656]

The following function plots the PMF of a binomial distribution with n tries and probability of success p.

def binom_pmf(n=4, p=0.5):
    # There are n+1 possible number of "successes": 0 to n.
    x = range(n+1)
    y = stats.binom.pmf(x, n, p)
    plt.plot(x,y,"o", color="black")

    # Format x-axis and y-axis.
    plt.axis([-(max(x)-min(x))*0.05, max(x)*1.05, -0.01, max(y)*1.10])
    plt.xticks(x)
    plt.title("Binomial distribution PMF for tries = {0} & p ={1}".format(
            n,p))
    plt.xlabel("Variate")
    plt.ylabel("Probability")

    plt.draw()

The plot for the binomial distribution with n = 0 and p = 0.5 is show below.

PMF for Binomial distribution

Cumulative density function (CDF)

CDF gives the probability that the variate has a value less than or equal to the given value.

>>> stats.norm.cdf(0.0, loc=0.0, scale=1.0)
0.5

The following function plots the CDF for a Gaussian distribution of given mean and standard deviation.

def norm_cdf(mean=0.0, std=1.0):
    # 50 numbers between -3σ and 3σ
    x = sp.linspace(-3*std, 3*std, 50)
    # CDF at these values
    y = stats.norm.cdf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Variate")
    plt.ylabel("Cumulative Probability")
    plt.title("CDF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

CDF for the Gaussian of mean = 0.0 and standard deviation = 1.0 is shown below.

CDF for Gaussian distribution

Percent point function (PPF) or inverse cumulative function

PPF is the inverse of the CDF. That is, PPF gives the value of the variate for which the cumulative probability has the given value.

>>> stats.norm.ppf(0.5, loc=0.0, scale=1.0)
0.0

The following function plots the PPF for a Gaussian distribution.

def norm_ppf(mean=0.0, std=1.0):
    # 100 numbers between 0 and 1.0 i.e., probabilities.
    x = sp.linspace(0, 1.0, 100)
    # PPF at these values
    y = stats.norm.ppf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Cumulative Probability")
    plt.ylabel("Variate")

    plt.title("PPF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

Example plot of PPF for Gaussian of mean = 0.0 and std. deviation = 1.0.

PPF for Gaussian distribution

Survival function (SF)

Survival function gives the probability that the variate has a value greater than the given value; SF = 1 - CDF.

>>> stats.norm.sf(0.0, loc=0.0, scale=1.0)
0.5

A function for plotting the SF for a Gaussian with a given mean and std. deviation is shown below. An example plot for mean = 0.0 and std. deviation = 1.0 is also shown.

def norm_sf(mean=0.0, std=1.0):
    # 50 numbers between -3σ and 3σ
    x = sp.linspace(-3*std, 3*std, 50)
    # SF at these values
    y = stats.norm.sf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Variate")
    plt.ylabel("Probability")
    plt.title("SF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

SF for Gaussian distribution.

Inverse survival function (ISF)

ISF is the inverse of the survival function. It gives the value of the variate for which the survival function has the given value.

>>> stats.norm.isf(0.5, loc=0.0, scale=1.0)
0.0

Function to plot the ISF, and an example plot, for a Gaussian is given below.

def norm_isf(mean=0.0, std=1.0):
    # 100 numbers between 0 and 1.0
    x = sp.linspace(0, 1.0, 100)
    # PPF at these values
    y = stats.norm.isf(x, loc=mean, scale=std)

    plt.plot(x,y, color="black")
    plt.xlabel("Probability")
    plt.ylabel("Variate")

    plt.title("ISF for Gaussian of mean = {0} & std. deviation = {1}".format(
               mean, std))
    plt.draw()

ISF for Gaussian distribution.

Random variates

A random value or several random values can be drawn from a distribution using the rvs method of the appropriate class.

>>> # 100 random values from a Poisson distribution with mu = 1.0
>>> stats.norm.rvs(loc=0.0, scale=1.0, size=100)
>>> # 100 random values from a Poisson distribution with mu = 1.0
>>> stats.poisson.rvs(1.0, size=100)

The following plot shows the histogram of events recorded in a series of 2 second intervals. The data is simulated by drawing 100 random events from a Poisson distribution with mean μ = 1.69. Also plotted is the Probability Mass Function (PMF) of the Poisson distribution with mean μ = 1.69. Poisson is a discrete distribution and the solid line is just a smooth line drawn through the points of the Poisson distribution PMF. The mean μ and the standard deviation σ = sqrt(μ) are marked.

Poisson simulation

The code used to generate the plot is given below.

import scipy as sp
from scipy import stats
from matplotlib import pyplot as plt
from scipy import interpolate

def simulate_poisson():
    # Mean is 1.69
    mu = 1.69
    sigma = sp.sqrt(mu)
    mu_plus_sigma = mu + sigma

    # Draw random samples from the Poisson distribution, to simulate
    # the observed events per 2 second interval.
    counts = stats.poisson.rvs(mu, size=100)

    # Bins for the histogram: only the last bin is closed on both
    # sides. We need one more bin than the maximum value of count, so
    # that the maximum value goes in its own bin instead of getting
    # added to the previous bin.
    # [0,1), [1, 2), ..., [max(counts), max(counts)+1]
    bins = range(0, max(counts)+2)

    # Plot histogram.
    plt.hist(counts, bins=bins, align="left", histtype="step", color="black")

    # Create Poisson distribution for given mu.
    x = range(0,10)
    prob = stats.poisson.pmf(x, mu)*100 

    # Plot the PMF.
    plt.plot(x, prob, "o", color="black")

    # Draw a smooth curve through the PMF.
    l = sp.linspace(0,11,100)
    s = interpolate.spline(x, prob, l)
    plt.plot(l,s,color="black")

    plt.xlabel("Number of counts per 2 seconds")
    plt.ylabel("Number of occurrences (Poisson)")

    # Interpolated probability at x = μ + σ; for marking σ in the graph.
    xx = sp.searchsorted(l,mu_plus_sigma) - 1
    v = ((s[xx+1] -  s[xx])/(l[xx+1]-l[xx])) * (mu_plus_sigma - l[xx])
    v += s[xx]

    ax = plt.gca()
    # Reset axis range and ticks.
    ax.axis([-0.5,10, 0, 40])
    ax.set_xticks(range(1,10), minor=True)
    ax.set_yticks(range(0,41,8))
    ax.set_yticks(range(4,41,8), minor=True)

    # Draw arrow and then place an opaque box with μ in it.
    ax.annotate("", xy=(mu,29), xycoords="data", xytext=(mu, 13),
                textcoords="data", arrowprops=dict(arrowstyle="->",
                                                   connectionstyle="arc3"))
    bbox_props = dict(boxstyle="round", fc="w", ec="w")
    ax.text(mu, 21, r"$\mu$", va="center", ha="center",
            size=15, bbox=bbox_props)

    # Draw arrow and then place an opaque box with σ in it.
    ax.annotate("", xy=(mu,v), xytext=(mu_plus_sigma,v),
                arrowprops=dict(arrowstyle="<->", connectionstyle="arc3"))
    bbox_props = dict(boxstyle="round", fc="w", ec="w")
    ax.text(mu+(sigma/2.0), v, r"$\sigma$", va="center", ha="center",
            size=15, bbox=bbox_props)

    # Refresh plot and save figure.
    plt.draw()
    plt.savefig("simulate_poisson.png")

simulate_poisson()

More information

For more information on using Scipy consult the documentation at http://docs.scipy.org. The Scipy cookbook has several examples of using various sub-packages in Scipy; for example, fitting and evaluating polynomials. The Scipy mailing list is a good place to ask any questions on using Scipy. Planet Scipy has links to several Scipy related blogs.

Advertisements
This entry was posted in Python and tagged , . Bookmark the permalink.

8 Responses to Simple statistics with SciPy

  1. Pingback: [Link] Statistics with Python

  2. Nin says:

    Very nice job! This kind of examples were exactly what i was looking for!

  3. Chris says:

    Great job, thanks very much!

  4. E. John says:

    Beautiful. Thanks.

  5. Great overview, thanks!

  6. For statistics and associated graphics, the R environment has far more built-in tools than Python. R and its >5000 CRAN add-on packages is public domain with >100,000 statistical functions and several independent graphical systems … far larger than any other stat software, and its freely available (http://www.r-project.org). Professional data scientists recommend both R and Python for different purposes; see discussions at https://asaip.psu.edu/forums/software-forum. Here is an R script that reproduces the Poisson random variates plot shown above:

    mu <- 1.69 # user-supplied intensity
    counts <- rpois(100, mu) # Poisson random numbers
    temp <- hist(counts, breaks=seq(-0.5,9.5), plot=F) # save histogram values
    plot(temp$mids-0.5, temp$counts, type='s', xlim=c(-0.5,10), ylim=c(0,max(temp$counts)*1.05),
    + main='', xlab='Number of counts per 2 seconds', ylab='Number of occurrences (Poisson)')
    + # step plot
    points(temp$mids, temp$counts, pch=20) # add points
    lines(spline(temp$mids, temp$counts, n=80)) # add spline curve
    ytext1 <- temp$counts[trunc(mu)+2] # annotate with mean and arrow
    text(mu, ytext1/1.5, expression(mu))
    arrows(mu, ytext1/1.5+1, mu, ytext1-1, length=0.1)
    lines(c(mu,mu), c(ytext1/4,ytext1/1.5-1))
    ytext2 <- temp$counts[trunc(mu)+3] # annotate with sigma and arrows
    text(mu+sqrt(mu)/2, ytext2/1.5, expression(sigma))
    arrows(mu+sqrt(mu)/2+0.2, ytext2/1.5, mu+sqrt(mu), ytext2/1.5, length=0.07)
    arrows(mu+sqrt(mu)/2-0.2, ytext2/1.5, mu, ytext2/1.5, length=0.07)
    dev.copy2pdf(file='simulate_poisson.pdf') # save on disk

  7. Kumar says:

    Very nice introduction. Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s