Bayesian data analysis with PyMC3


Thomas Wiecki


Quantopian Inc.



About me

  • PhD candidate at Brown studying decision making using Bayesian modeling.
  • Quantitative Researcher at Quantopian Inc: Building the world's first algorithmic trading platform in the web browser.

Why should you care about Bayesian Data Analysis?

In [1]:
from IPython.display import Image
import prettyplotlib as ppl
from prettyplotlib import plt
import numpy as np
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica'], 'size': 22})
rc('xtick', labelsize=14) 
rc('ytick', labelsize=14)
## for Palatino and other serif fonts use:
#rc('font',**{'family':'serif','serif':['Palatino']})
rc('text', usetex=True)
%matplotlib inline
In [2]:
Image('backbox_ml.png')
Out[2]:
  • Blackbox models not good at conveying what they have learned.
In [4]:
Image('openbox_pp.png')
Out[4]:

Probabilistic Programming

  • Model unknown causes of a phenomenon as random variables.
  • Write a programmatic story of how unknown causes result in observable data.
  • Use Bayes formula to invert generative model to infer unknown causes.

Random Variables as Probability Distributions

  • Represents our beliefs about an unknown state.
  • Probability distribution assigns a probability to each possible state.
  • Not a single number (e.g. most likely state).

Coin-flipping experiment.

  • Given multiple flips, what is probability of getting heads?
  • Maximum Likelihood answer:

$$\frac{\# \text{heads}}{\text{total throws}}$$

  • However:

$$\frac{50}{100} = \frac{1}{2}$$

  • Clearly something is missing!
  • Quantification of uncertainty.

Moreover...

  • Consider a single flip which comes up heads:

$$ P(\text{heads}) = \frac{1}{1} = 1 $$

  • Again, this doesn't seem right.
  • Incorporate prior knowledge.
In [2]:
from scipy import stats
# set every possibility to be equally possible
x_coin = np.linspace(0, 1, 101)

$$ \text{Express probability of heads as random variable } \theta$$

In [4]:
import prettyplotlib as ppl
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel=r'Hypothesis for chance of heads', 
            ylabel=r'Probability of hypothesis', 
            title=r'Prior probability distribution after no coin tosses')
ppl.plot(ax, x_coin, stats.beta(2, 2).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
# fig.savefig('coin1.png')
Out[4]:
[<matplotlib.text.Text at 0x48432d0>,
 <matplotlib.text.Text at 0x4841210>,
 <matplotlib.text.Text at 0x486e590>,
 <matplotlib.text.Text at 0x486ec50>,
 <matplotlib.text.Text at 0x4872350>,
 <matplotlib.text.Text at 0x4872a10>]
/home/wiecki/envs/pymc3/local/lib/python2.7/site-packages/IPython/core/formatters.py:239: FormatterWarning: Exception in image/png formatter: LaTeX was not able to process the following string:
'0\\%'
Here is the full report generated by LaTeX: 

This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian)
 restricted \write18 enabled.
entering extended mode
(./73e233a9842fdd7f7961a32eb5d9efbb.tex
LaTeX2e <2011/06/27>
Babel <3.9h> and hyphenation patterns for 2 languages loaded.
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2007/10/19 v1.4h Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))
(/usr/share/texlive/texmf-dist/tex/latex/type1cm/type1cm.sty)
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/helvet.sty
(/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty))
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/courier.sty)
(/usr/share/texlive/texmf-dist/tex/latex/base/textcomp.sty
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1enc.def))
(/usr/share/texlive/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ifpdf.sty)
(/usr/share/texlive/texmf-dist/tex/generic/oberdiek/ifvtex.sty)
(/usr/share/texlive/texmf-dist/tex/generic/ifxetex/ifxetex.sty)

Package geometry Warning: Over-specification in `h'-direction.
    `width' (5058.9pt) is ignored.


Package geometry Warning: Over-specification in `v'-direction.
    `height' (5058.9pt) is ignored.

)
No file 73e233a9842fdd7f7961a32eb5d9efbb.aux.
(/usr/share/texlive/texmf-dist/tex/latex/base/ts1cmr.fd)
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ot1pnc.fd)
! Font OT1/pnc/m/n/10=pncr7t at 10.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.11 \begin{document}
                     
*geometry* driver: auto-detecting
*geometry* detected driver: dvips
(/usr/share/texlive/texmf-dist/tex/latex/psnfss/ot1phv.fd)
! Font OT1/phv/m/n/14=phvr7t at 14.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.12 \fontsize{14.000000}{17.500000}{\sffamily
                                               0\%}
! Font OT1/pnc/m/n/14=pncr7t at 14.0pt not loadable: Metric (TFM) file not foun
d.
<to be read again> 
                   relax 
l.13 \end{document}
                   
[1] (./73e233a9842fdd7f7961a32eb5d9efbb.aux) )
(see the transcript file for additional information)
Output written on 73e233a9842fdd7f7961a32eb5d9efbb.dvi (1 page, 188 bytes).
Transcript written on 73e233a9842fdd7f7961a32eb5d9efbb.log.

  FormatterWarning,

<matplotlib.figure.Figure at 0x41dd610>
In [98]:
stats.beta(2,2).rvs(1)
Out[98]:
array([ 0.37833632])
In [21]:
import random
import numpy as np

def choose_coin():
    return stats.beta(2, 2).rvs(1) # random.uniform(0,1) # np.random.normal(0,1) # 
# pylab.hist([choose_coin() for dummy in range(100000)], normed=True, bins=100)
successes = []
returns = 0.
for i in range(10000):
    prob_heads = choose_coin()
    results = [random.uniform(0,1) < prob_heads for dummy in range(10)]
    if results.count(True) == 9: 
        successes.append(prob_heads)
        if random.uniform(0,1) < prob_heads:
            returns += -1
        else:
            returns += 1
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
r = ax.hist(np.array(successes), normed=True, bins=20)
print len(successes)
print "Average return", returns / len(successes)
719
Average return -0.613351877608

$$ \theta \sim \text{Beta}(2, 2) $$ $$ P(\theta) = \text{Beta}(2, 2) $$

In [108]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Hypothesis for chance of heads', 
            ylabel='Probability of hypothesis', 
            title='Posterior probability distribution after first heads')
ppl.plot(ax, x_coin, stats.beta(1, 1).pdf(x_coin)*stats.beta(90, 10).pdf(x_coin), linewidth=3.)
ax.set_xticklabels([r'0\%', r'20\%', r'40\%', r'60\%', r'80\%', r'100\%']);
plt.savefig('coin2.png')
In []:
 

$$ P(\theta | h=1) = \text{Beta}(3, 2) $$

In [10]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Hypothesis for chance of heads', 
            ylabel='Probability of hypothesis', 
            title='Posterior probability distribution after 1 head, 1 tail')
ppl.plot(ax, x_coin, stats.beta(3, 3).pdf(x_coin), linewidth=3.)
ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
fig.savefig('coin3.png')
In [11]:
Image('coin3.png')
Out[11]:

$$ P(\theta | [h=1, t=1]) = \text{Beta}(3, 3) $$

In [12]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Hypothesis for chance of heads', 
            ylabel='Probability of hypothesis', 
            title='Posterior probability distribution after 20 heads and 20 tails')
ppl.plot(ax, x_coin, stats.beta(22, 22).pdf(x_coin), linewidth=3.)
ax.set_xticklabels(['0\%', '20\%', '40\%', '60\%', '80\%', '100\%']);
fig.savefig('coin4.png')
In [13]:
Image('coin4.png')
Out[13]:

$$ P(\theta | [h=20, t=20]) = \text{Beta}(22, 22) $$

Bayes Formula

$$P(\theta| \text{data}) \propto P(\theta) \ast P(\text{data} |\theta)$$ $$\text{posterior} \propto \text{prior} \ast \text{likelihood}$$

$\theta$: Parameters of model (chance of getting heads)).

  • Except in simple cases, posterior impossible to compute analytically.
  • Blackbox approximation algorithm: Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.
In [14]:
from scipy import stats
fig = ppl.plt.figure(figsize=(14, 6))
ax1 = fig.add_subplot(121, title='What we want', ylim=(0, .5), xlabel=r'\theta', ylabel=r'P(\theta)')
ppl.plot(ax1, np.linspace(-4, 4, 100), stats.norm.pdf(np.linspace(-3, 3, 100)), linewidth=4.)
ax2 = fig.add_subplot(122, title='What we get', xlim=(-4, 4), ylim=(0, 1800), xlabel=r'\theta', ylabel='\# of samples')
ax2.hist(np.random.randn(10000), bins=20);
In [14]:
 

Approximating the posterior with MCMC sampling

In [15]:
Image('wantget.png')
Out[15]:

PyMC3

  • Probabilistic Programming framework written in Python.
  • Allows for construction of probabilistic models using intuitive syntax.
  • Features advanced MCMC samplers.
  • Fast: Just-in-time compiled by Theano.
  • Extensible: easily incorporates custom MCMC algorithms and unusual probability distributions.

Linear Models

  • Assumes a linear relationship between two variables.
  • E.g. stock price between Gold and Gold Miners.
In [16]:
size = 200
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + np.random.normal(scale=.5, size=size)

data = dict(x=x, y=y)
In [17]:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Synthetic data and underlying model')
ppl.scatter(ax, x, y, label='sampled data')
ppl.plot(ax, x, true_regression_line, label='true regression line', linewidth=4.)
ax.legend(loc=2);
fig.savefig('synth_data.png')
In [18]:
Image('synth_data.png')
Out[18]:

Linear Regression

$$ y_i = \alpha + \beta \ast x_i + \epsilon $$

with $$ \epsilon \sim \mathcal{N}(0, \sigma^2) $$

Probabilistic Reformulation

$$ y_i \sim \mathcal{N}(\alpha + \beta \ast x_i, \sigma^2) $$

Priors

$$ \alpha \sim \mathcal{N}(0, 20^2) $$ $$ \beta \sim \mathcal{N}(0, 20^2) $$ $$ \sigma \sim \mathcal{U}(0, 20) $$

Constructing model in PyMC3

In [19]:
import pymc as pm

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define priors
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    # Define linear regression
    y_est = alpha + beta * x
    
    # Define likelihood
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
/Users/alexcoventry/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

Covenience function glm()

In [20]:
with pm.Model() as model:
    # specify glm and pass in data. The resulting linear model, its likelihood and 
    # and all its parameters are automatically added to our model.
    pm.glm.glm('y ~ x', data)
    step = pm.NUTS() # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling

Posterior

In [21]:
fig = pm.traceplot(trace, lines={'alpha': 1, 'beta': 2, 'sigma': .5});
In [22]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x, y, label='data')
from pymc import glm
glm.plot_posterior_predictive(trace, samples=100, 
                              label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, label='true regression line', linewidth=5.)
ax.legend(loc=0);
fig.savefig('ppc1.png')
In [22]:
 
In [23]:
Image('ppc1.png')
Out[23]:

Robust Regression

In [24]:
# Add outliers
x_out = np.append(x, [.1, .15, .2, .25, .25])
y_out = np.append(y, [8, 6, 9, 7, 9])

data_out = dict(x=x_out, y=y_out)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')
Out[24]:
<matplotlib.axes.AxesSubplot at 0x118ba0150>
In [25]:
with pm.Model() as model:
    glm.glm('y ~ x', data_out)
    trace = pm.sample(2000, pm.NUTS(), progressbar=False)
In [26]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111,  xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out, label='data')
glm.plot_posterior_predictive(trace, samples=100, 
                              label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, 
         label='true regression line', linewidth=5.)

plt.legend(loc=0);
fig.savefig('ppc2.png')
In [27]:
Image('ppc2.png')
Out[27]:
In [28]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
normal_dist = pm.Normal.dist(mu=0, sd=1)
t_dist = pm.T.dist(mu=0, lam=1, nu=1)
x_eval = np.linspace(-8, 8, 300)
ppl.plot(ax, x_eval, pm.theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label='Normal', linewidth=2.)
ppl.plot(ax, x_eval, pm.theano.tensor.exp(t_dist.logp(x_eval)).eval(), label='Student T', linewidth=2.)
plt.xlabel('x')
plt.ylabel('Probability density')
plt.legend();
fig.savefig('t-dist.png')

Fit strongly biased by outliers

  • Normal distribution has very light tails.
  • -> sensitive to outliers.
  • Instead, use Student T distribution with heavier tails.
In [29]:
Image('t-dist.png')
Out[29]:
In [30]:
with pm.Model() as model_robust:
    family = pm.glm.families.T()
    pm.glm.glm('y ~ x', data_out, family=family)
    
    trace_robust = pm.sample(2000, pm.NUTS(), progressbar=False)
In [31]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Value of gold', ylabel='Value of gold miners', title='Posterior predictive regression lines')
ppl.scatter(ax, x_out, y_out)
glm.plot_posterior_predictive(trace_robust, samples=100,
                              label='posterior predictive regression lines')
ppl.plot(ax, x, true_regression_line, 
         label='true regression line', linewidth=5.)
plt.legend();
fig.savefig('ppc3.png')
In [32]:
Image('ppc3.png')
Out[32]:

Real-world example: Algorithmic Trading

  • Pairtrading is a famous technique that plays two stocks against each other.
  • For this to work, stocks must be correlated (cointegrated).
  • One common example is the price of gold (GLD) and the price of gold mining operations (GDX).
In [33]:
import zipline
import pytz
from datetime import datetime
fig = plt.figure(figsize=(8, 4))

prices = zipline.data.load_from_yahoo(stocks=['GLD', 'GDX'], 
                                 end=datetime(2013, 8, 1, 0, 0, 0, 0, pytz.utc)).dropna()[:1000]
prices.plot();
GLD
GDX

<matplotlib.figure.Figure at 0x123e0d210>
In [34]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, xlabel='Price GDX in \$', ylabel='Price GLD in \$')
colors = np.linspace(0.1, 1, len(prices))
mymap = plt.get_cmap("winter")
sc = ax.scatter(prices.GDX, prices.GLD, c=colors, cmap=mymap, lw=0)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);
fig.savefig('price_corr.png')
In [35]:
Image('price_corr.png')
Out[35]:

Naive model assumes constant linear regression.

In [36]:
with pm.Model() as model_reg:
    family = pm.glm.families.Normal()
    pm.glm.glm('GLD ~ GDX', prices, family=family)
    trace_reg = pm.sample(2000, pm.NUTS(), progressbar=False)

Hm... kinda unsatisfying...

In [37]:
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, xlabel='Price GDX in \$', ylabel='Price GLD in \$', 
            title='Posterior predictive regression lines')
sc = ax.scatter(prices.GDX, prices.GLD, c=colors, cmap=mymap, lw=0)
glm.plot_posterior_predictive(trace_reg, samples=100, 
                              label='posterior predictive regression lines',
                              lm=lambda x, sample: sample['Intercept'] + sample['GDX'] * x,
                              eval=np.linspace(prices.GDX.min(), prices.GDX.max(), 100))
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);
ax.legend(loc=0);
fig.savefig('ppc4.png')
In [38]:
Image('ppc4.png')
Out[38]:
  • Clearly the regression between GDX and GLD changes over time.
  • But it does so gradually.
  • Can we build a model that allows for gradual changes in the coefficients?
  • YES!

Improved model

  • Assumes that intercept and slope follow a random walk.
  • At each time-point, the coefficients can move a step from their previous values.
  • This allows the coefficients to track the regression as it changes over time.

$$ \alpha_t \sim \mathcal{N}(\alpha_{t-1}, \sigma_\alpha^2) $$ $$ \beta_t \sim \mathcal{N}(\beta_{t-1}, \sigma_\beta^2) $$

In [39]:
from pymc.distributions.timeseries import *
from theano.tensor import repeat

$$\text{Priors for }\sigma_{\alpha}\text{ and }\sigma_{\beta}$$

In [40]:
model_randomwalk = pm.Model()
with model_randomwalk:
    # std of random walk, best sampled in log space.
    sigma_alpha, log_sigma_alpha = model_randomwalk.TransformedVar(
                            'sigma_alpha', 
                            pm.Exponential.dist(1./.02, testval = .1), 
                            pm.logtransform
    )
    sigma_beta, log_sigma_beta = model_randomwalk.TransformedVar(
                            'sigma_beta', 
                            pm.Exponential.dist(1./.02, testval = .1),
                            pm.logtransform
    )

Define regression coefficients to follow a random walk.

In [41]:
# To make the model simpler, we will apply the same coefficient for 50 data points at a time
subsample_alpha = 50
subsample_beta = 50

with model_randomwalk:
    alpha = GaussianRandomWalk('alpha', sigma_alpha**-2, 
                               shape=len(prices) / subsample_alpha)
    beta = GaussianRandomWalk('beta', sigma_beta**-2, 
                              shape=len(prices) / subsample_beta)
    
    # Make coefficients have the same length as prices
    alpha_r = repeat(alpha, subsample_alpha)
    beta_r = repeat(beta, subsample_beta)    

Define regression and likelihood

In [42]:
with model_randomwalk:
    # Define regression
    regression = alpha_r + beta_r * prices.GDX.values
    
    # Assume prices are Normally distributed, the mean comes from the regression.
    sd = pm.Uniform('sd', 0, 20)
    likelihood = pm.Normal('y', 
                           mu=regression, 
                           sd=sd, 
                           observed=prices.GLD.values)

Inference!

In [43]:
from scipy import optimize
with model_randomwalk:
    # First optimize random walk
    start = pm.find_MAP(vars=[alpha, beta], fmin=optimize.fmin_l_bfgs_b)
    
    # Sample
    step = pm.NUTS(scaling=start)
    trace_rw = pm.sample(2000, step, start=start, progressbar=False)

intercept changes over time.

In [44]:
fig = plt.figure(figsize=(8, 6))
ax = plt.subplot(111, xlabel='time', ylabel='alpha', title='Change of alpha over time.')
ppl.plot(ax, trace_rw[-1000:][alpha].T, 'r', alpha=.05);
ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);
fig.savefig('rwalk_alpha.png')
In [45]:
Image('rwalk_alpha.png')
Out[45]:

Slope changes over time.

In [46]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='time', ylabel='beta', title='Change of beta over time')
ppl.plot(ax, trace_rw[-1000:][beta].T, 'b', alpha=.05);
ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);
fig.savefig('rwalk_beta.png')
In [47]:
Image('rwalk_beta.png')
Out[47]:

Regression slowly adapts to best fit current data

In [48]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Price GDX in \$', ylabel='Price GLD in \$', 
            title='Posterior predictive regression lines')

colors = np.linspace(0.1, 1, len(prices))
colors_sc = np.linspace(0.1, 1, len(trace_rw[-500::10]['alpha'].T))
mymap = plt.get_cmap('winter')
mymap_sc = plt.get_cmap('winter')

xi = np.linspace(prices.GDX.min(), prices.GDX.max(), 50)
for i, (alpha, beta) in enumerate(zip(trace_rw[-500::10]['alpha'].T, trace_rw[-500::10]['beta'].T)):
    for a, b in zip(alpha, beta):
        ax.plot(xi, a + b*xi, alpha=.05, lw=1, c=mymap_sc(colors_sc[i]))
        
sc = ax.scatter(prices.GDX, prices.GLD, label='data', cmap=mymap, c=colors)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);
fig.savefig('ppc5.png')
In [49]:
Image('ppc5.png')
Out[49]:

Conclusions

  • Probabilistic Programming allows you to tell a genarative story.
  • Blackbox inference algorithms allow estimation of complex models.
  • PyMC3 puts advanced samplers at your fingertips.

Outstanding Issues

Scalability

  • Variational Inference
  • see also Max Welling's work for scaling MCMC

Usability

  • still too difficult to use
  • wanted: library on top of PyMC3 with common models

Further reading