Bayesian Hierarchical Linear Modeling

(With Stan)

What is HLM?

"Hierarchical/Multilevel modeling is a generalization of linear and generalized linear modeling in which regression coefficients are themselves given a model, whose parameters are also estimated from data" (Gelman 2006)

What about Stan?

A flexible probabilistic programming language for statistical modeling along with a suite of inference tools for fitting models that are robust, scalable, and efficient

  • DSL for full (HMC, NUTS) & approximate bayesian inference written in C++

  • Interfaces for R, Python, Julia, Matlab & command line (plus MCMC analysis and visual summaries with shinyStan)

#Hype

A Tale of Two Extremes

Example: Modeling standardized test scores of students, with data on the individual as well what class each individual is in.

  • student identifier $_{i}$

  • classroom identifier $_{k}$

  • standardized test scores $y_{ik}$

Tradtional regression would allow modeling in one of two ways:

Complete Pooling:

Lump everyone together and ignore classroom distinction

Estimate $\hat \theta = \bar y = \frac{ \sum_{i} \sum_{k} y_{ik} }{ \sum_{k} n_{k}}$

No Pooling:

Treat each classroom separately

Estimate $\hat \theta_{k} = \bar y_{.k}$

Limitations

  • With no pooling, we lose information across classrooms since many classrooms may be similar to each other

  • With complete pooling, we lose information based on classroom which may be very valuable

  • So we need a compromise between the two extremes

HLM to the Rescue

  • Natural hierarchy/structure makes HLM ideal

  • Partial Pooling

  • Amount of pooling determined by similarity of data

HLM to the Rescue

Brief Overview of Bayesian Statistics

$P(A, B) = P(A) \times P(B | A)$

$P(A, B) = P(B) \times P(A | B)$

$P(A) \times P(B | A) = P(B) \times P(A | B)$

$\textit{Bayes' Rule}$: $P(B | A) = \frac{P(B) \times P(A | B)}{P(A)}$

Bayes' Rule

$P(B | A) \propto P(B) \times P(A | B)$

"$posterior$" $\propto$ "$prior$" $\times$ "$likelihood$"

Back to our example...

prior: $\theta_{k} \sim N( \mu, \tau^2)$
likelihood: $\bar y_{.k} \sim N(\theta_{k}, \frac{\sigma^2}{n_{k}})$
posterior: $N( \mu, \tau^2) \times N(\theta_{k}, \frac{\sigma^2}{n_{k}}) = N(*mean, *variance)$

Estimate $\hat \theta_{k} = *mean$

Partial Pooling and the Shrinkage Factor

$\hat \theta_{k} = \frac{ \frac{n_{k} \bar y_{.k}}{\sigma^2} + \frac{\mu}{\tau^2} }{ \frac{n_{k}}{\sigma^2} + \frac{1}{\tau^2} } = (1 - B_{k}) \bar y_{.k} + B_{k} \mu$
where $B_{k} = \frac{ \frac{1}{\tau^2} }{ \frac{n_{k}}{\sigma^2} + \frac{1}{\tau^2} } = \frac{ \frac{\sigma^2}{n_{k}} }{ \tau^2 + \frac{\sigma^2}{n_{k}} }$ is called the $\textit{shrinkage factor}$

If $\tau^2 >> \frac{\sigma^2}{n_{k}}$, then $B_{k} \to 0$ and $\hat \theta_{k} \to \bar y_{.k}$ (No Pooling)

If $\tau^2 << \frac{\sigma^2}{n_{k}}$, then $B_{k} \to 1$ and $\hat \theta_{k} \to \mu$ (Complete Pooling)

Shrinkage toward Pooled Mean ($\mu$)

Using HLM for Predictions

Example: Predict --------- ---

Using HLM for Predictions

  • 'complete pooling' won't work

  • not enough data for 'no pooling'

  • must use HLM

Setting up the HLM

--------- --- Model

linear model $\implies$ $\hat \theta = \alpha + X\beta$

  • $X$ is predictor/covariate matrix

  • $\beta$ is vector of regression coefficients

  • $\alpha$ is intercept term

identifiers:

  • country $j$

  • region $l$

  • vertical $k$

  • industry $m$

Started from the bottom...

$y_{ijklm}$ = --------- --- for advertisable $i$
$y_{ijlkm} \sim student(\nu, \alpha_{jklm} + X_{i}\beta_{jklm}, \sigma)$

$\nu$ and $\sigma$ are degrees of freedom, standard deviation parameters estimated in Stan

Model for Regression Coefficients $\beta_{jklm}$

  • coefficients (and intercepts) vary by country/vertical

  • coefficent = 'country effect' + 'vertical effect'

$\beta_{jklm} = \gamma_{jl} + \gamma_{km}$

  • and in turn, the 'country effect' is made up of a 'country' and 'region' component

  • likewise the 'vertical effect' is made up of a 'vertical' and 'industry' component

$\gamma_{jl} = \delta_{j} + \delta_{l}$
$\gamma_{km} = \delta_{k} + \delta_{m}$

Models for Country and Vertical components



Country ($j$):
$\delta_{j} \sim MVN(0, \sigma_{j})$

Vertical ($k$):
$\delta_{k} \sim MVN(0, \sigma_{k})$

Models for Region and Industry components



Region ($l$):
$\delta_{l} = \kappa_{l} + \mu_{l}$ where $\mu_{l} \sim MVN(0, \sigma_{l})$

Industry ($m$):
$\delta_{m} = \kappa_{m} + \mu_{m}$ where $\mu_{m} \sim MVN(0, \sigma_{m})$

Summary of --------- --- Model



Transformed Parameters:
$\alpha_{jklm}$, $\beta_{jklm}$, $\gamma_{jl}$, $\gamma_{km}$, $\delta_{l}$, $\delta_{m}$

Parameters:
$\delta_{j}$, $\delta_{k}$, $\kappa_{l}$, $\kappa_{m}$, $\mu_{l}$, $\mu_{m}$, $\sigma$, $\nu$

Hyperparameters:
$\sigma_{j}$, $\sigma_{k}$, $\sigma_{l}$, $\sigma_{m}$

Everything estimated in Stan

Summary of --------- --- Model



Noninformative priors/hyperpriors on parameters/hyperparameters:

$\nu \sim \Gamma(2, .1)$

$\sigma, \sigma_{j}, \sigma_{k}, \sigma_{l}, \sigma_{m} \sim cauchy(0, 2.5)$

Summary of --------- --- Model

Stan Code

Almost perfect transliteration of model diagram:

Making Predictions

  • can predict at different spend levels by varying the --------- spend covariate (holding all others constant)

  • full posterior distribution of predicted values provide credible ranges at each level

  • result is a spend- ------ curve for each ------

Pros & Cons of HLM

Advantages:

  • improved predictive accuracy over traditional linear regression

  • works well with sparse data (more parameters than data points okay)

  • range of credible values given by posterior

Disadvantages:

  • MCMC is computationally expensive and not embarrassingly parallel (only individual chains can be run in parallel)

  • not "off the shelf"