Bayesian data analysis with PyMC3

Thomas Wiecki

Quantopian Inc.

About me

PhD candidate at Brown studying decision making using Bayesian modeling.
Quantitative Researcher at Quantopian Inc: Building the world's first algorithmic trading platform in the web browser.

Why should you care about Bayesian Data Analysis?

In [2]:

Image('backbox_ml.png')

Out[2]:

Blackbox models not good at conveying what they have learned.

In [4]:

Image('openbox_pp.png')

Out[4]:

Probabilistic Programming

Model unknown causes of a phenomenon as random variables.
Write a programmatic story of how unknown causes result in observable data.
Use Bayes formula to invert generative model to infer unknown causes.

Random Variables as Probability Distributions

Represents our beliefs about an unknown state.
Probability distribution assigns a probability to each possible state.
Not a single number (e.g. most likely state).

Coin-flipping experiment.

Given multiple flips, what is probability of getting heads?
Maximum Likelihood answer:

$$\frac{\# \text{heads}}{\text{total throws}}$$

However:

$$\frac{50}{100} = \frac{1}{2}$$

Clearly something is missing!
Quantification of uncertainty.

Moreover...

Consider a single flip which comes up heads:

$$ P(\text{heads}) = \frac{1}{1} = 1 $$

Again, this doesn't seem right.
Incorporate prior knowledge.

$$ \text{Express probability of heads as random variable } \theta$$

In [4]:

import prettyplotlib as ppl
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel=r'Hypothesis for chance of heads', 
            ylabel=r'Probability of hypothesis', 
            title=r'Prior probability distribution after no coin tosses')
ppl.plot(ax, x_coin, stats.beta(2, 2).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
# fig.savefig('coin1.png')

Out[4]:

[<matplotlib.text.Text at 0x48432d0>,
 <matplotlib.text.Text at 0x4841210>,
 <matplotlib.text.Text at 0x486e590>,
 <matplotlib.text.Text at 0x486ec50>,
 <matplotlib.text.Text at 0x4872350>,
 <matplotlib.text.Text at 0x4872a10>]

/home/wiecki/envs/pymc3/local/lib/python2.7/site-packages/IPython/core/formatters.py:239: FormatterWarning: Exception in image/png formatter: LaTeX was not able to process the following string:
'0\\%'
Here is the full report generated by LaTeX: 

This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian)
 restricted \write18 enabled.
entering extended mode
(./73e233a9842fdd7f7961a32eb5d9efbb.tex
LaTeX2e <2011/06/27>
Babel <3.9h> and hyphenation patterns for 2 languages loaded.
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/type1cm/type1cm.sty)
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/helvet.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty))
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/courier.sty)
(/usr/share/texlive/texmf-dist/tex/latex/base/textcomp.sty
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1enc.def))
(/usr/share/texlive/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ifpdf.sty)
(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ifvtex.sty)
(/usr/share/texlive/texmf-dist/tex/generic/ifxetex/ifxetex.sty)

Package geometry Warning: Over-specification in `h'-direction.
    `width' (5058.9pt) is ignored.


Package geometry Warning: Over-specification in `v'-direction.
    `height' (5058.9pt) is ignored.

)
No file 73e233a9842fdd7f7961a32eb5d9efbb.aux.
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1cmr.fd)
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ot1pnc.fd)
! Font OT1/pnc/m/n/10=pncr7t at 10.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.11 \begin{document}
                     
*geometry* driver: auto-detecting
*geometry* detected driver: dvips
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ot1phv.fd)
! Font OT1/phv/m/n/14=phvr7t at 14.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.12 \fontsize{14.000000}{17.500000}{\sffamily
                                               0\%}
! Font OT1/pnc/m/n/14=pncr7t at 14.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.13 \end{document}
                   
[1] (./73e233a9842fdd7f7961a32eb5d9efbb.aux) )
(see the transcript file for additional information)
Output written on 73e233a9842fdd7f7961a32eb5d9efbb.dvi (1 page, 188 bytes).
Transcript written on 73e233a9842fdd7f7961a32eb5d9efbb.log.

  FormatterWarning,

<matplotlib.figure.Figure at 0x41dd610>

In [98]:

stats.beta(2,2).rvs(1)

Out[98]:

array([ 0.37833632])

In [21]:

import random
import numpy as np

def choose_coin():
    return stats.beta(2, 2).rvs(1) # random.uniform(0,1) # np.random.normal(0,1) # 
# pylab.hist([choose_coin() for dummy in range(100000)], normed=True, bins=100)
successes = []
returns = 0.
for i in range(10000):
    prob_heads = choose_coin()
    results = [random.uniform(0,1) < prob_heads for dummy in range(10)]
    if results.count(True) == 9: 
        successes.append(prob_heads)
        if random.uniform(0,1) < prob_heads:
            returns += -1
        else:
            returns += 1
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
r = ax.hist(np.array(successes), normed=True, bins=20)
print len(successes)
print "Average return", returns / len(successes)

719
Average return -0.613351877608

$$ \theta \sim \text{Beta}(2, 2) $$ $$ P(\theta) = \text{Beta}(2, 2) $$

In []:

$$ P(\theta | h=1) = \text{Beta}(3, 2) $$

In [11]:

Image('coin3.png')

Out[11]:

$$ P(\theta | [h=1, t=1]) = \text{Beta}(3, 3) $$

In [13]:

Image('coin4.png')

Out[13]:

$$ P(\theta | [h=20, t=20]) = \text{Beta}(22, 22) $$

Bayes Formula

$$P(\theta| \text{data}) \propto P(\theta) \ast P(\text{data} |\theta)$$ $$\text{posterior} \propto \text{prior} \ast \text{likelihood}$$

$\theta$: Parameters of model (chance of getting heads)).

Except in simple cases, posterior impossible to compute analytically.
Blackbox approximation algorithm: Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.

In [14]:

Approximating the posterior with MCMC sampling

In [15]:

Image('wantget.png')

Out[15]:

PyMC3

Probabilistic Programming framework written in Python.
Allows for construction of probabilistic models using intuitive syntax.
Features advanced MCMC samplers.
Fast: Just-in-time compiled by Theano.
Extensible: easily incorporates custom MCMC algorithms and unusual probability distributions.

Linear Models

Assumes a linear relationship between two variables.
E.g. stock price between Gold and Gold Miners.

In [18]:

Image('synth_data.png')

Out[18]:

Linear Regression

$$ y_i = \alpha + \beta \ast x_i + \epsilon $$

with $$ \epsilon \sim \mathcal{N}(0, \sigma^2) $$

Probabilistic Reformulation

$$ y_i \sim \mathcal{N}(\alpha + \beta \ast x_i, \sigma^2) $$

Priors

$$ \alpha \sim \mathcal{N}(0, 20^2) $$ $$ \beta \sim \mathcal{N}(0, 20^2) $$ $$ \sigma \sim \mathcal{U}(0, 20) $$

Constructing model in PyMC3

In [19]:

import pymc as pm

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define priors
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    # Define linear regression
    y_est = alpha + beta * x
    
    # Define likelihood
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling

/Users/alexcoventry/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

Covenience function glm()

In [20]:

with pm.Model() as model:
    # specify glm and pass in data. The resulting linear model, its likelihood and 
    # and all its parameters are automatically added to our model.
    pm.glm.glm('y ~ x', data)
    step = pm.NUTS() # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling

Posterior

In [21]:

fig = pm.traceplot(trace, lines={'alpha': 1, 'beta': 2, 'sigma': .5});

In [22]:

In [23]:

Image('ppc1.png')

Out[23]:

Robust Regression

In [24]:

# Add outliers
x_out = np.append(x, [.1, .15, .2, .25, .25])
y_out = np.append(y, [8, 6, 9, 7, 9])

data_out = dict(x=x_out, y=y_out)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')

Out[24]:

<matplotlib.axes.AxesSubplot at 0x118ba0150>

In [25]:

with pm.Model() as model:
    glm.glm('y ~ x', data_out)
    trace = pm.sample(2000, pm.NUTS(), progressbar=False)

In [27]:

Image('ppc2.png')

Out[27]:

Fit strongly biased by outliers

Normal distribution has very light tails.
-> sensitive to outliers.
Instead, use Student T distribution with heavier tails.

In [29]:

Image('t-dist.png')

Out[29]:

In [30]:

with pm.Model() as model_robust:
    family = pm.glm.families.T()
    pm.glm.glm('y ~ x', data_out, family=family)
    
    trace_robust = pm.sample(2000, pm.NUTS(), progressbar=False)

In [32]:

Image('ppc3.png')

Out[32]:

Real-world example: Algorithmic Trading

Pairtrading is a famous technique that plays two stocks against each other.
For this to work, stocks must be correlated (cointegrated).
One common example is the price of gold (GLD) and the price of gold mining operations (GDX).

In [33]:

import zipline
import pytz
from datetime import datetime
fig = plt.figure(figsize=(8, 4))

prices = zipline.data.load_from_yahoo(stocks=['GLD', 'GDX'], 
                                 end=datetime(2013, 8, 1, 0, 0, 0, 0, pytz.utc)).dropna()[:1000]
prices.plot();

GLD
GDX

<matplotlib.figure.Figure at 0x123e0d210>

In [35]:

Image('price_corr.png')

Out[35]:

Naive model assumes constant linear regression.

In [36]:

with pm.Model() as model_reg:
    family = pm.glm.families.Normal()
    pm.glm.glm('GLD ~ GDX', prices, family=family)
    trace_reg = pm.sample(2000, pm.NUTS(), progressbar=False)

Hm... kinda unsatisfying...¶

In [38]:

Image('ppc4.png')

Out[38]:

Clearly the regression between GDX and GLD changes over time.
But it does so gradually.
Can we build a model that allows for gradual changes in the coefficients?
YES!

Improved model

Assumes that intercept and slope follow a random walk.
At each time-point, the coefficients can move a step from their previous values.
This allows the coefficients to track the regression as it changes over time.

$$ \alpha_t \sim \mathcal{N}(\alpha_{t-1}, \sigma_\alpha^2) $$ $$ \beta_t \sim \mathcal{N}(\beta_{t-1}, \sigma_\beta^2) $$

$$\text{Priors for }\sigma_{\alpha}\text{ and }\sigma_{\beta}$$

In [40]:

model_randomwalk = pm.Model()
with model_randomwalk:
    # std of random walk, best sampled in log space.
    sigma_alpha, log_sigma_alpha = model_randomwalk.TransformedVar(
                            'sigma_alpha', 
                            pm.Exponential.dist(1./.02, testval = .1), 
                            pm.logtransform
    )
    sigma_beta, log_sigma_beta = model_randomwalk.TransformedVar(
                            'sigma_beta', 
                            pm.Exponential.dist(1./.02, testval = .1),
                            pm.logtransform
    )

Define regression coefficients to follow a random walk.

In [41]:

# To make the model simpler, we will apply the same coefficient for 50 data points at a time
subsample_alpha = 50
subsample_beta = 50

with model_randomwalk:
    alpha = GaussianRandomWalk('alpha', sigma_alpha**-2, 
                               shape=len(prices) / subsample_alpha)
    beta = GaussianRandomWalk('beta', sigma_beta**-2, 
                              shape=len(prices) / subsample_beta)
    
    # Make coefficients have the same length as prices
    alpha_r = repeat(alpha, subsample_alpha)
    beta_r = repeat(beta, subsample_beta)

Define regression and likelihood

In [42]:

with model_randomwalk:
    # Define regression
    regression = alpha_r + beta_r * prices.GDX.values
    
    # Assume prices are Normally distributed, the mean comes from the regression.
    sd = pm.Uniform('sd', 0, 20)
    likelihood = pm.Normal('y', 
                           mu=regression, 
                           sd=sd, 
                           observed=prices.GLD.values)

Inference!

In [43]:

from scipy import optimize
with model_randomwalk:
    # First optimize random walk
    start = pm.find_MAP(vars=[alpha, beta], fmin=optimize.fmin_l_bfgs_b)
    
    # Sample
    step = pm.NUTS(scaling=start)
    trace_rw = pm.sample(2000, step, start=start, progressbar=False)

intercept changes over time.

In [45]:

Image('rwalk_alpha.png')

Out[45]:

Slope changes over time.

In [47]:

Image('rwalk_beta.png')

Out[47]:

Regression slowly adapts to best fit current data

In [49]:

Image('ppc5.png')

Out[49]:

Conclusions

Probabilistic Programming allows you to tell a genarative story.
Blackbox inference algorithms allow estimation of complex models.
PyMC3 puts advanced samplers at your fingertips.

Outstanding Issues

Scalability

Variational Inference
see also Max Welling's work for scaling MCMC

Usability

still too difficult to use
wanted: library on top of PyMC3 with common models