{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The best of both worlds: Hierarchical Linear Regression in PyMC3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(c) Thomas Wiecki & Danne Elbers 2020\n", "\n", "The power of Bayesian modelling really clicked for me when I was first introduced to hierarchical modelling. In this blog post we will highlight the advantage of using hierarchical Bayesian modelling as opposed to non-hierarchical Bayesian modelling. This hierachical modelling is especially advantageous when multi-level data is used, making the most of all information available by its 'shrinkage-effect', which will be explained below. \n", "\n", "Having multiple sets of measurements comes up all the time, in Psychology for example you test multiple subjects on the same task. You then might want to estimate a model that describes the behavior as a set of parameters relating to mental functioning. Often we are interested in individual differences of these parameters but also assume that subjects share similarities (being human and all). Software from our lab, [HDDM](http://ski.cog.brown.edu/hddm_docs/), allows hierarchical Bayesian estimation of a widely used decision making model but we will use a more classical example of hierarchical linear regression here to predict radon levels in houses.\n", "\n", "This is the 3rd blog post on the topic of Bayesian modeling in PyMC3, see here for the previous two:\n", "\n", " * [The Inference Button: Bayesian GLMs made easy with PyMC3](https://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/)\n", " * [This world is far from Normal(ly distributed): Bayesian Robust Regression in PyMC3](https://twiecki.github.io/blog/2013/08/27/bayesian-glms-2/) \n", "\n", "## The data set\n", "Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all county's of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to enter the house through the basement. Moreover, its concentration is thought to differ regionally due to different types of soil.\n", "\n", "Here we'll investigate this difference and try to make predictions of radon levels in different countys and where in the house radon was measured. In this example we'll look at Minnesota, a state that contains 85 county's in which different measurements are taken, ranging from 2 till 80 measurements per county. \n", "\n", "![radon](http://www.bestinspectionsllc.com/wp-content/uploads/2016/09/how-radon-enters-a-house-768x678.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll load the data: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pymc3 as pm \n", "import pandas as pd\n", "\n", "data = pd.read_csv('radon.csv')\n", "\n", "county_names = data.county.unique()\n", "county_idx = data['county_code'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The relevant part of the data we will model looks as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | county | \n", "log_radon | \n", "floor | \n", "
---|---|---|---|
0 | \n", "AITKIN | \n", "0.832909 | \n", "1.0 | \n", "
1 | \n", "AITKIN | \n", "0.832909 | \n", "0.0 | \n", "
2 | \n", "AITKIN | \n", "1.098612 | \n", "0.0 | \n", "
3 | \n", "AITKIN | \n", "0.095310 | \n", "0.0 | \n", "
4 | \n", "ANOKA | \n", "1.163151 | \n", "0.0 | \n", "