The BRAG Guide to Stan

Introduction

This tutorial provides a basic introduction to the programming language Stan. Stan is a free and open-source probabilistic modelling language specifically used for Bayesian statistical inference, including MCMC optimisation and variational inference. It has received much attention and as a faster platform for Bayesian inference compared to others such as JAGS and WinBUGS, through the use of Hamiltonian Monte Carlo (HMC) sampling.

In this tutorial, we will learn how to implement Bayesian analysis in Stan from defining the model to reading outputs. First, we define the model and prior information as code blocks in Stan, then submit the model for computation and sampling in R. Afterwards, we will explore a selection of simple model fit examples applicable to practical situations, including sports, education and disease spread. Finally, we will learn about the advantages and limitations of Stan language.

From our experience, it may take a while to setup RStan, so we recommend you to set up RStan beforehand. You will need to:

Install R for Windows version 3.4.0 or later.
Install RStudio for Windows version 1.2.x or later.
Create a local folder to store RStan package, its dependencies and all other packages used in this tutorial (eg.Documents/Stan-tutorial/r-libraries)
Open RStudio and use .libPath() function to direct RStudio session to the created folder

.libPaths("path/to/directory")

Follow the instructions given on RStan-Getting-Started page to install and load RStan.

Other packages required for this tutorial are: tidyverse, bayesplot, deSolve.

Stan models can be compiled in other platforms such as Python, MATLAB, Julia, Stata, Mathematica, or Scala. However, the examples discussed in this tutorial are run using RStan (the R interface to Stan) on Windows operating system. We do not cover theory of Bayesian statistics nor MCMC convergence, but expect users to have a basic understanding of these theories.

If you would like to use HPC during this tutorial, the procedure to follow is same as any other programming language. You may use the BRAG HPC guide for this setup. We do not cover how to setup Stan on Mac or any other operating system.

Downloading the tutorial files

To start the tutorial:

Go to the github repository brag-stan-guide.
Click the green “Clone or Download” button above, and then “Download ZIP”.
Unzip the repository in your desired location.
Open the file: brag-stan-guide.Rproj.

Defining a model in Stan

Stan models are specified in several blocks. The main ones are:

data
parameters
transformed parameters
model
generated quantities

To illustrate the use of these blocks, let’s set up a model for a coin-flip experiment.

library(rstan)
library(knitr)
flip_data <- c(0,1,0,1,1,0,0,1,1,0)

We first need to tell Stan what our data look like, and every quantity in Stan is declared as a particular type of quantity (int, or real) and bounds are given if appropriate. To specify a vector, we can use the square brackets ([]) with the length of the vector inside the brackets.

data {
  int<lower=0> n_trials; // number of coin-flip trials
  int<lower=0,upper=1> Y[n_trials]; // data (heads = 1, tails = 0)
}

We now specify the parameters of the model. Suppose for illustration that instead of the standard Bernoulli parameter \(\theta \in \left[0,1\right]\) we are interested in the logit-probability parameter \(\phi\): \[ \phi = \text{logit}{\left(\theta\right)} = \text{log}{\left(\frac{\theta}{1-\theta}\right)} \] We declare \(\phi\) in the parameter block:

parameters {
  real phi; // logit-probability as parameter of interest
}

The transformed parameter block declares and defines any values which are a function of the parameters. Stan’s bernoulli function doesn’t accept the logit-probability \(\phi\), so we will define \(\theta\) here to use it when defining the model. We use upper and lower bounds to precisely define the values which \(\theta\) can have:

transformed parameters {
  real<lower=0,upper=1> theta;
  theta = exp(phi)/(1+exp(phi)); // probability as a function of logit_prob.
}

We are now ready to define the model. This is the generating process of the data, based on the parameter values. The prior distributions of the parameters are also defined in this block. Stan is quite nice in that the sampling distribution of a vector can be defined in a single line, as well as the traditional for-loop method from JAGS and WinBUGS.

model {
  phi ~ normal(0,10^4); // prior for phi is a wide (SD = 10^4) normal distribution centered at 0.

  Y ~ bernoulli(theta); // define the likelihood with vector statement.
  
  // You can also define using a for loop, looping over each trial:
  // for(trial in 1:n_trials){
  //   Y[trial] ~ bernoulli(theta); // define 
  // }
}

Finally, any quantities generated after sampling has occurred, such as predicted values for new data, or transformations of the parameter(s) not necessary for the model{} block. We will here generate the relative odds, as well as a predicted value for a new trial.

generated quantities {
  int<lower=0,upper=1> Y_dash; // Prediction for new trial
  real<lower=0> relative_odds; // Relative odds (how likely is heads compared to tails?)
  relative_odds = prob/(1-prob);
  Y_dash = bernoulli_rng(prob);
}

The above blocks have been saved in a text file called coin_flip.stan. By using the .stan file type, Rstudio knows it is a Stan model and applies the correct formatting to it. It also provides a syntax-checking function, so it can tell you when you’ve left off a “;” and the end of lines.

To run the code, we can either feed it straight into the stan() function:

cf_result.stanfit <- stan(
  file = "models/coin_flip.stan",
  data = list(
    n_trials = length(flip_data),
    Y = flip_data
  ))

or we can compile the model first:

cf.stanmodel <- stan_model(
  file = "models/coin_flip.stan"
)

and then use the sampling() function:

cf_result.stanfit <- sampling(
  object = cf.stanmodel,
  data = list(
    n_trials = length(flip_data),
    Y = flip_data
  ))

If the underlying file models/coin_flip.stan doesn’t change, then Stan will only compile the model the first time stan() or stan_model() are called.

There are two more blocks which may be of use: - functions: For defining your own functions (see Stan Reference Manual Chapter 9 for details) - transformed data: For declaring and defining intermediate values which are functions of your data. For example, for positive skewed data, you might model the log of the data instead of the raw data:

transformed data {
  real log_Y[n_trials];
  log_Y = log(Y)
}

Example: Fitting Basketball Data

In this example, we use basketball shooting data to illustrate the use of a Bayesian logistic regression model.

Loading the R libraries we need:

library("ggplot2")
library("rstan")
library("bayesplot")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source('source-code/0 utils.R') # to plot the basketball court
set.seed("314413") #making the results reproducible

Data

We import the shooting data from the file called basketball.csv into a data frame called df and we check the structure of the variables.

df <- read.csv("data/basketball.csv")
#str(df)
df$outcome <- factor(as.character(df$biscorecounter))

The problem

We formulate the hypothesis that some of the independent variables or predictors play a crucial role in the shooting effectiveness (success or fail). For instance, we known that the distance from the basket of the player taking the shot (dis) is relevant to the outcome. We assess the impact of the predictors on the response variable outcome.

A summary of four covariates and the response variable is shown below:

Variable	type	Description
outcome	binary	scoring (1), missing(0)
dis	numeric	distance from the basket (in foot)
rad	numeric	angle (in radians). Zero corresponds to the line from basket to basket
postime	integer	possession time from 0 to 25 seconds
dis_trav	numeric	distance traveled (in foot)

We visually inspect the distribution of the four covariates in successful (1) and missed shots (0) using box plots.

par(mfrow = c(2,2), oma = c(2,4,3,1.5) , mar = c(1.5,2,1.5,1.5))
boxplot(dis ~ outcome, data = df, main= "distance", ylab= "distance") 
boxplot(rad ~ outcome, data = df, main="angle", ylab="angle") 
boxplot(postime ~ outcome,data = df, main="possession time", ylab="possession time") 
boxplot(dis_trav ~ outcome, data = df, main="distance travelled", ylab="distance travelled")

P_180 + 
  geom_jitter(data = df, aes(x = y-10, y = -x+6, 
                          shape = outcome,
                          col = outcome), alpha= 0.2) + 
            theme_bw() + xlim(0,47)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

Bayesian model

We define the model in Stan language as follows. Since the response is binary, we use the Bernoulli likelihood (bernoulli_logit function). We use a logistic regression model with an intercept \(\beta_1\) and slopes \(\beta_2\), \(\beta_3\), etc.

model <- '
  data {
    int<lower=0> N;
    vector[N] dis;
    vector[N] rad;
    vector[N] postime;
    vector[N] dis_trav;
    int<lower=0,upper=1> outcome[N];
    
  }
  parameters {
    real beta[5];
  }
  model {
    outcome ~ bernoulli_logit(beta[1] + beta[2] * dis + beta[3] * rad + beta[4] * postime + beta[5] * dis_trav);
  
  //priors
  beta ~ normal(0, 100);  
  }
  '

In rjags/BUGS the model would be defined as follows:

model_jags = "
model {
  for (i in 1 : N) {
    outcome[i] ~ dbern(p[i])   #likelihood
    logit(p[i]) <- beta[1] + beta[2] * dis[i] + beta[3] * rad[i] + beta[4] * postime[i] + beta[5] * dis_trav[i]
  }
  
  # Priors
  for (j in 1 : 5) {
    beta[j] ~ dnorm(0, 1 / 100 ^ 2)
  }

}"

We then create a list containing the number of observations, the predictors, and the response.

data <- list(N = nrow(df), # total numb of shots
             outcome = as.numeric(as.character(df$outcome)),
             dis= df$dis,
             rad = df$rad,
             postime = df$postime,
             dis_trav = df$dis_trav
)

To fit the model in stan , we have to specify the model, the data list created above. We also need the other parameters e.g. the number of chains and iterations, etc.

fit <- stan(model_code = model,
                 model_name = "model",
                 pars = 'beta', 
                 data = data,
                 iter = 5000,   
                 warmup = 2000, 
                 thin = 2, 
                 chains = 3, 
                 verbose = F,
                 seed = 2019,
                 refresh = max(5000/10, 1)
)

stats <- summary(fit)
stats <- stats$summary
kable(stats[1:5,])

	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
beta[1]	0.6684490	0.0034865	0.2229762	0.2348632	0.5144865	0.6658918	0.8156175	1.1086425	4090.059	0.9999769
beta[2]	-0.0513403	0.0001541	0.0101798	-0.0713097	-0.0581530	-0.0512236	-0.0443981	-0.0313309	4362.449	1.0000693
beta[3]	-0.1544514	0.0012310	0.0788210	-0.3090133	-0.2077561	-0.1535710	-0.1018914	-0.0042478	4099.631	1.0000707
beta[4]	-0.0758065	0.0004416	0.0280942	-0.1314366	-0.0948278	-0.0759443	-0.0564092	-0.0224400	4046.669	0.9998445
beta[5]	0.0038214	0.0000501	0.0031118	-0.0023219	0.0017390	0.0038116	0.0058796	0.0099789	3854.495	0.9997607

Performing then a diagnostic of the chains by visually inspecting the trace and the posterior density plots.

color_scheme_set("blue")
posterior <- as.array(fit)
mcmc_dens_overlay(posterior, pars = c('beta[1]', 'beta[2]', 
                                              'beta[3]', 'beta[4]',
                                              'beta[5]'))

mcmc_trace(posterior, pars = c('beta[1]', 'beta[2]', 
                                       'beta[3]', 'beta[4]',
                                       'beta[5]'))

mcmc_acf(posterior, pars = c('beta[1]', 'beta[2]', 
                                    'beta[3]', 'beta[4]',
                                    'beta[5]'), lags = 10)

Example: Fitting to Eight School Data

This is a hierarchical Bayesian model, which is used to study an educational experiment data which is described in Gelman et al. (2013). The purpose of the study was to examine the effects of a test preparation program conducted in eight different high schools in New Jersey. A separate randomised experiment was conducted in each school, and the administrators of each school implemented the program in their own way. The model is described as : \[ \begin{align*} y_j \sim Normal(\theta_j, \sigma_j^{2}), \\ \theta_j \sim Normal(\mu, \tau^{2}), \\ \mu \sim Normal(0, 5), \\ \tau \sim HalfCauchy(0,5), \end{align*} \] where \(j = 1,..., 8\), The measurements \(y_j\) and uncertainties \(\sigma_j\) are the estimates and standard errors from separate regressions performed for each school, as shown in in the following table.

School	Estimated effect, \(y_j\)	Standard error of effect, \(\sigma_j\)
A	28	15
B	8	10
C	-3	16
D	7	11
E	-1	9
F	1	11
G	18	10
H	12	18

This is a well known example for Stan users. We use this example here to illustrate two things. First, to use the Stan program to generate data from the model, which may be used for prior predictive analysis or to validate the probabilistic model. Second, to study the effect of centered and non-centered parametrisation of the model on MCMC diagnostics. We have already saved the data as schools_data.RData in the data folder.

Sometimes we may wish to generate fake data from our probabilistic model to validate our model. For this, parameters are drawn from the priors, and then estimated afterwards from the model. The problem with this approach is that usually we have to consider two bits of code; data generation model and the estimation model. If we make a change in one model, we must reflect the change in the other model. This may not be an issue for simple models, but may be an issue for complex models.

We can fix this issue by making use of Stan’s generated quantities block, and using one piece of code for both purposes. We do this by evaluating the likelihood conditionally. If the likelihood is not evaluated, then the posterior predictive draws are same as the prior predictive draws - that is the fake data.

We use the following Stan program to simulate fake data from the eight school hierarchical model, illustrating the use of one piece of code for simulation and estimation.

//saved as 8school_centered.stan

data {
  int<lower=0> J;         //number of schools 
  real y[J];              // estimated treatment effects
  real<lower=0> sigma[J]; // sd. error of effect estimates 
  int<lower =0, upper =1> run_estimation; //a switch to evaluate likelihood (=1) or not (=0)
}
parameters {
  real mu; 
  real<lower=0> tau;
  real theta[J];
}
model {
  //priors
  mu ~ normal(0,5);
  tau ~ cauchy(0,5);
  theta ~ normal(mu, tau);
  
  //evaluate the likelihood conditionally
  if(run_estimation == 1){
    y ~ normal(theta, sigma);
  }
}
generated quantities{
  real y_sim[J]; //array to hold simulated data
 
    for(j in 1:J) {
       y_sim[j] =  normal_rng(theta[j], sigma[j]);
    }
}

The value assigned to the variable run_estimation determines if the likelihood is evaluated or not. If the value given to this variable is \(0\), then the likelihood is not computed, and the samples returned by y_sim , defined in the generated block, are the fake data. If the value of run_estimation is \(1\), then the likelihood is evaluated and the samples returned by y_sim are the posterior predictions. Let us generate some data using this Stan program.

# first load rstan
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# then load the school data
load('data/schools_data.RData')

#prepare data to pass to Stan program
stan_data <- list(J = 8,         
             y =  schools_data$y,
             sigma = schools_data$sigma, 
             run_estimation = 0  #switch off likelihood
            )

schools_sim_data <-  stan("models/8school_centered.stan",  #the Stan program
                         data = stan_data,        #data to be used by Stan program
                         chains = 1,               #number of MCMC chains
                         iter = 1000,               #number if iterations per chain
                         control = list(adapt_delta = 0.99, max_treedepth = 15)
                         #warmup = 1000
                      )
#save(schools_sim_data, file = "results/schools_sim_data.RData", envir = .GlobalEnv)

# extract simulated data 
fake_data <- as.data.frame(extract(schools_sim_data, pars = 'y_sim'))

Your Turn

Use the fake data corresponding to one of the iteration (eg. 1st iteration) to validate the model. For this, you will need to assign the value \(1\) to run_estimation variable.

Note that the above Stan program provides a centered parametrisation of the hierarchical model. Let us fit actual data using the above Stan program.

# prepare data to pass to Stan program
stan_data <- list(J = 8,         
             y =  schools_data$y,
             sigma = schools_data$sigma, 
             run_estimation = 1  #switch off likelihood
            )

schools_fit <-  stan("models/8school_centered.stan",  #the Stan program
                         data = stan_data,        #data to be used by Stan program
                         chains = 4,               #number of MCMC chains
                         iter = 2500,               #number if iterations per chain
                         #control = list(adapt_delta = 0.99, max_treedepth = 15)
                         warmup = 1000
                      )

## Warning: There were 74 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

There is a warning message regarding some divergent transitions. Divergences in HMC arise when the Hamiltonian transition encounters regions of extremely large curvature. Stan uses a heuristic to quickly identify these misbehaving trajectories, and hence label divergences, without having to wait for them to run all the way to infinity. Sometimes this issue can be resolved by adjust the step size, of the Hamiltonian transition. The smaller the step size the more accurate the trajectory and the less likely it will be mislabeled as a divergence. In order to do this, we can increase the adapt_delta parameter from its default value of \(0.8\) closer to its maximum value of \(1\).

In some cases, divergences can be fixed by increasing the adapt_delta parameter, but in other cases the model needs to be reparametrised to overcome divergences. If adapt_delta is increased the step size is decreased. To visualise where divergences occur we can use pairwise plots.

pairs(schools_fit, pars = c("theta[1]","tau"), las = 1)

We can see that most divergences (red spots) occurs when \(\tau\) is near \(0\). Let us now use a non-centered parametrisation of the model to see if divergences are reduced. In non-centered parametrisation, we do not try to fit the hierarchical parameter (\(\theta\)) directly, but we fit an extra parameter \(\hat{\theta}_{j} \sim Normal(0,1)\), from which we can recover the hierarchical parameter using \(\theta_{j} = \mu + \tau * \hat{\theta}_{j}\). We recover the parameter in the transformed parameters block.

//saved as 8school_non_centered.stan
data {
  int<lower=0> J;         //number of schools 
  real y[J];              // estimated treatment effects
  real<lower=0> sigma[J]; // sd. error of effect estimates 
  int<lower =0, upper =1> run_estimation; //a switch to evaluate likelihood (=1) or not (=0)
}
parameters {
  real mu; 
  real<lower=0> tau;
  real theta_hat[J];
}
 transformed parameters {
   real theta[J];
   for (j in 1:J)
        theta[j] = mu + tau * theta_hat[j];
 }
model {
  //priors
  mu ~ normal(0,5);
  tau ~ cauchy(0,5);
  theta_hat ~ normal(0,1);
 //evaluate the likelihood conditionally
  if(run_estimation == 1){
    y ~ normal(theta, sigma);
  }
}
generated quantities{
  real y_sim[J]; //array to hold simulated data
 
    for(j in 1:J) {
       y_sim[j] =  normal_rng(theta[j], sigma[j]);
    }
}

Let us run the new model.

schools_fit_ncp <-  stan("models/8school_non_centered.stan",  #the Stan program
                         data = stan_data,        #data to be used by Stan program
                         chains = 4,               #number of MCMC chains
                         iter = 2500,               #number if iterations per chain
                         #control = list(adapt_delta = 0.99, max_treedepth = 15)
                         warmup = 1000
                      )
#save(schools_fit_ncp, file = "results/schools_fit_ncp.RData", envir = .GlobalEnv)

No divergences are produced.

Example: Fitting an Ordinary Differential Equation Model

In this example, we fit a simple Susceptible-Infective-Recovered (SIR) epidemic model to simulated data. The SIR epidemic model is used to describe the spread of self limiting diseases, such as measles, within a human population. The model comprises of three types of individuals; susceptible, infected and recovered. Susceptible individuals become infected when in contact with infected individuals. Infected individuals remain infectious for a period of time and then recover with life long immunity. For the example considered here, let \(S(t)\), \(I(t)\) and \(R(t)\) be the proportion of susceptible, infected and recovered individuals, respectively, at time \(t\). We assume that the population is closed, so that \(S(t) +I(t)+R(t) =1\). The dynamics of this SIR epidemic is described by the following system of ordinary differential equations (ODEs). \[ \begin{align} \frac{dS}{dt} = & \, - \beta S I \nonumber\\ \frac{dI}{dt} = & \, \beta S I - \gamma I\nonumber\\ \frac{dR}{dt} = & \, \gamma I \nonumber \end{align} \] Parameters \(\beta\) and \(\gamma\) correspond to the infection and recovery rates, respectively. Let us first visualise the dynamic of these equations, using given parameter values and initial conditions. In practice, these are not know, but can be estimated from data. We use the deSolve package developed for R interface to solve the system.

library(deSolve)
library(dplyr)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())  #executes multiple Markov chains in parallel
#-----------------------------------
#Follow the following two steps to repare data to feed into R's 'ode()'  function (see the 'ode' function help page for more details)
#1. define the SIR ODE system as a function 
SIR_model <- function(times, y, pars) {
  with(as.list(c(pars, y)), {
    
    dS <- - beta * y[1] * y[2]
    dI <- beta * y[1] * y[2] - gamma * y[2]
    dR <- gamma * y[2]

    res <- c(dS,dI,dR)
    list(res)
  })
}

#2. define initial condition of ODEs and parameters 
I0 <- 0.01    # initial fraction infected
S0 <- 1 - I0 # initial fraction susceptible
R0 <- 0
pars <- c(beta = 0.5, gamma = 0.1) #parameters 
inits <- c(S0, I0, R0)              #initial condition of ODE
times <- 0:60                       #time span of the epidemic (days)

# Solve the ODEs using 'ode' function
SIR_dynamics <- ode(y = inits, times=times, func = SIR_model, parms=pars, method="ode45")

#store the output in a data frame for later use
SIR_dynamics <- data.frame(SIR_dynamics)
colnames(SIR_dynamics) <- c("time", "S", "I", "R")

Let us now visualise the dynamic of these equations.

plot(NA,NA, xlim = c(0, 60), ylim=c(0, 1), xlab = "Time (days)", ylab="Proportion of population")
lines(SIR_dynamics$S ~ SIR_dynamics$time, col="black")
lines(SIR_dynamics$I ~ SIR_dynamics$time, col="blue")
lines(SIR_dynamics$R ~ SIR_dynamics$time, col="red")
legend(x = 40, y = 0.9, legend = c("Susc.", "Inf.", "Recov."),
       col = c("black",  "blue", "red"), lty = c(1, 1,1), bty="n")

This gives us the true epidemic dynamics of the population. In practice, it may not be possible to determine the proportion infected at each day, nor to determine disease status of all individuals at a given time. Rather, we can take a sample of individuals from the population at various time points and find out how many are infected. We expect binomially distributed error in these estimates, with probability of success given by the infected dynamic of the ODEs. The sampling can be done as follows.

n_obs <- 30     # number of days sampled 
n_sample <- 50   # number of  individuals sampled per day

# Choose which days the samples are taken. 
ts <- sort(sample(1:60, n_obs, replace=F))

# Extract the "true" fraction of the population infected at each of the sampled days:
 prop_inf <- SIR_dynamics[SIR_dynamics$time %in% ts, 3]

# Generate binomially distributed data.
sample_y <- rbinom(n_obs, n_sample, prop_inf)

Now that we have some simulated data, let us estimate parameters \(\beta\), \(\gamma\) and initial conditions of the ODE using the following Stan program, which is a slightly different version available here. The Stan model is already saved in a text file called SIR_fit.stan, in the models folder.

  functions {
     real[] SIR_ode(real t,  //time
     real[] y,               //state
     real[] params,          //parameters
     real[] x_r,             //real data
     int[] x_i) {            //integer data
  
     real dydt[3];
   
     real beta = params[1];
     real gamma = params[2];
     
     dydt[1] = - beta * y[1] * y[2];
     dydt[2] = beta * y[1] * y[2] - gamma * y[2];
     dydt[3] = gamma * y[2];
  
  return dydt;
 }
}
data {
  int<lower = 1> n_obs;      // number of days sampled
  int<lower = 1> n_sample;   // number of individuals sampled at each time point.
  int y[n_obs];              // observed data
  real t0;                   // initial time point  
  real ts[n_obs];            // time points that were sampled
  int<lower = 1> n_pred;     // number of predicted data points
  real t_pred[n_pred];        // time points at which predictions are made
}
transformed data {
  real x_r[0];
  int x_i[0];
}
parameters {
  real<lower = 0> params[2];      // the two model parameters, constrained to be positive
  real<lower = 0, upper = 1> I0;  //initial fraction of infectives, constrained to be between 0 and 1
}

transformed parameters{
  real y_hat[n_obs, 3];     // output from the ODE solver
  real y0[3];                // initial conditions of ODE
  
   y0[1] = 1-I0;
   y0[2] = I0;
   y0[3] = 0;
  
    // solve ODE for observed time points
  y_hat = integrate_ode_rk45(SIR_ode, y0, t0, ts, params, x_r, x_i);
  
}
model {
  //priors
  params ~ normal(0,2); 
  I0 ~ normal(0.5, 0.5);  
  
 //likelihood 
     y ~ binomial(n_sample, y_hat[, 2]); 
}
generated quantities {
  real y_pred[n_pred, 3];  //prediction from ode  
 
  // Generate predicts  over the entire time span:
  y_pred = integrate_ode_rk45(SIR_ode, y0, t0, t_pred, params, x_r, x_i);
}

Note that the functions block defines the system of ODEs as a function (SIR_ode) with specific signature. The function takes in a time t(a real value, for time dependent parameters), a system state y (real array), system parameters params (real array), real data (x_r) and integer data (x_i). The SIR_ode function returns the array of derivatives of the system with respect to time, evaluated at time t and state y. The SIR equations do not have time dependent parameters nor use real or integer data, so these are not used. Nevertheless, these arguments must be included in the system function.

The data block defines a real value t0, which is passed to Stan’s ODE solver integrate_ode_rk45. We set this value as \(0\) for this example. It also defines number of predicted data points, n_pred, and the predicted time points, t_pred, both are used in the generated quantities block.

The transformed data block consists of two empty data arrays. These empty arrays can also be defined in the ODE solver as:

y_hat = integrate_ode_rk45(SIR_ode, y0, t0, ts, params, x_r, x_i,
                   rep_array(0.0, 0), rep_array(0, 0));

We include these in the transformed block for illustration.

The solutions to the SIR equations are defined as transformed parameters. This allows them to be used in the model block and inspected in the output. We use the Runge-Kutta solver, integrate_ode_rk45(), to solve the system of ODEs. The Runge-Kutta solver is used to solve non-stiff systems. To compute solutions, this function requires the ODE function (defined as SIR_ode), initial state (y0), initial time (t0), requested solution times (ts), parameters (params) and real and integer data (x_r, x_i, respectively). The value assigned to t0 must be less than the values assigned for ts. For stiff ODEs, the solver used is integrate_ode_bdf() function, with same arguments as used for the integrate_ode_rk45(). Both integration functions are limited as to the origin of variables used in their arguments. The arguments ts, x_r and x_i must be expressions that only involve data or transformed data variables. The initial state y0 and parameters params are the only arguments which may involve parameters.

The model block defines priors and the likelihood. Finally, we compute predictions, using the fitted values, in the generated quantities block. Rather than computing predictions at the sampled time points, we make predictions for the entire epidemic time span.

To fit the model, we need to assign values for the variables defined in the data block.

data <- list(n_obs = n_obs,              #number of days sampled
              n_sample = n_sample,      #number of individuals sampled
              y = sample_y,             #observed data from binomial distribution
              t0 = 0,                   #initial time for ODE
              ts = ts,                 #time points at which observations are made
              n_pred = length(1:60),   #number of predictions
              t_pred = c(1:60)        #times at which predictions are made
         )
#save(data, file = "data/SIR_data.RData", envir = .GlobalEnv)

We have saved the data as SIR_data.RData in the data folder. You may wish to use the saved data for model fitting. Let us first test the Stan program using a shorter chain.

test <- stan("models/SIR_fit.stan",     #the path and name of the Stan program
            data = data,               #list containing data
             chains = 1,               #number of MCMC chains
            iter = 10,                #number if iterations per chain
            control = list(adapt_delta = 0.8, max_treedepth = 10)
            )

No errors are produced for this test run. We can now fit the model using multiple chains with more iterations.

SIR_fit <- stan(fit = test,
           data =  data,
            chains = 4,
           warmup = 1000,
           iter = 2500
           )
#save(SIR_fit, file = "results/SIR_fit.RData", envir = .GlobalEnv)

We have saved result of this fit as SIR_fit.RData, in the results folder. We must now check some MCMC convergences and if parameters used to simulate data can be recovered. The traceplot() and print() functions can be used for these checks.

traceplot(SIR_fit, pars = c("params", "I0", 'lp__'),  inc_warmup =TRUE)
print(SIR_fit, pars = c("params", "y0"), prob = c(0.025, 0.5, 0.975),digits=3)

## Inference for Stan model: SIR_fit.
## 4 chains, each with iter=2500; warmup=1000; thin=1; 
## post-warmup draws per chain=1500, total post-warmup draws=6000.
## 
##            mean se_mean    sd  2.5%   50% 97.5% n_eff  Rhat
## params[1] 0.444   0.001 0.036 0.378 0.442 0.518  2476 1.001
## params[2] 0.103   0.000 0.006 0.093 0.103 0.115  3567 1.000
## y0[1]     0.981   0.000 0.007 0.965 0.982 0.992  2186 1.001
## y0[2]     0.019   0.000 0.007 0.008 0.018 0.035  2186 1.001
## y0[3]     0.000     NaN 0.000 0.000 0.000 0.000   NaN   NaN
## 
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 10:34:44 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The trace plots show that the chains mixed well and all chains converging to same posterior. Output from the print function shows posterior summary along with Rhat value for each parameter. The median estimate are \(\beta = 0.442\) and \(\gamma = 0.103\), \(S(0) = 0.982\), \(I(0) = 0.018\) and \(R(0) = 0\), which are close to the values used to generate data. The Rhatstatistic is the ratio of between-chain variance to within-chain variance and should be approximately \(1\pm 0.1\) if the chains have converged, which is true for this fit.

Let us now plot posterior predictions to see how well the model fits observed data.

#extract the predicted posterior samples 
post <- extract(SIR_fit, pars= ('y_pred'))

#compute the 95% posterior credible interval 
cred_median <- apply(post$y_pred[,,2], 2, median)
cred_low <- apply(post$y_pred[,,2], 2, quantile, probs=c(0.025))
cred_high <- apply(post$y_pred[,,2], 2, quantile, probs=c(0.975))

#create two data frames; one for observed data and one for predicted data
df_obs <- data.frame(y = sample_y/n_sample, time =  data$ts)
df_pred <- data.frame(cred_median, cred_low, cred_high, time =  data$t_pred)

#plot the credible interval with observed data
ggplot(df_obs, aes(x=time, y=y)) +
  geom_point(col="black", shape = 19, size = 1.5) +
  geom_line(data = df_pred, aes(x=time, y=cred_median), color = "blue") +
  geom_line(data = df_pred, aes(x=time, y=cred_high), color = "blue", linetype=3) +
  geom_line(data = df_pred, aes(x= time, y=cred_low), color = "blue", linetype=3) +
   labs(x = "Time (days)", y = "Proportion infected")

The Stan model can recreate the observed dynamics reasonably well. In this example, we described the SIR system of ODEs using fractions of the population, instead of numbers. This allowed us to estimate initial fractions of the ODEs, as well as ODEs parameters (\(\beta\) and \(\gamma\)), as Stan can sample only continuous parameters. Currently, Stan cannot handle integer parameters.

Exercise: Fitting a time series model in Stan

## [1] FALSE

data("nhtemp")
plot(nhtemp,ylab = expression(Temperature ~(degree~"F")),xlab = "Year")

The nhtemp data-set in R is a time series of temperatures (measured in \(^\circ\)F) collected in New Haven, Connecticut over 60 years from 1912-1971. The proposed model for the process is that the temperature depends on the previous year’s temperature as well global temperatures, which are assumed to rise linearly each year: \[ y_i = \alpha + \beta y_{i-1} +\gamma i + \epsilon_i, \] where \[ \epsilon_i \sim \text{Normal}{\left(0,\sigma^2\right)}. \]

Explanation for each parameter is:

\(\alpha\) is an “intercept” temperature (it does not correspond exactly correspond to the expected temperature at \(i=0\) as the stationary mean is \(\alpha/\left(1-\beta\right)\)\(^\circ\)F)
\(\beta\) is the first order auto-correlation parameter. If \(\beta\) is 0, then there is no auto-correlation, whereas if \(\beta\) is less than or greater than 0, then the current value \(y_i\) will be dependent on the previous value \(y_{i-1}\).
\(\gamma\) is the linear trend, where each year the average temperature increases by \(\gamma\)\(^\circ\)F.

Create a Stan model and use MCMC sampling in R to:

Estimate the first order auto-correlation in the model.
Estimate the long term linear trend.
Forecast the temperature for the next 20 years (1972 - 1991) after the data-set.

Use this code as a template for your Stan model:

data {
  int<lower=0> N;
  int<lower=2> N_new;
  vector[N] y;
}

parameters {
  FILL
}

model {
  for (n in 2:N){
    y[n] ~ normal(FILL, FILL);
  }
}

generated quantities{
  real y_new[N_new];
  y_new[1] = normal_rng(FILL, FILL);
  for (n in 2:N_new){
    y_new[n] = normal_rng(FILL, FILL);
  }
}

To learn more about fitting a time series in Stan, check out the Time Series chapter in the Stan User’s Guide.

Hamiltonian Monte Carlo: the engine under the hood

Stan uses Hamiltonian Monte Carlo (HMC) in order to generate samples from the posterior distribution of our model parameters. HMC tends to be much more efficient than other types of Markov Chain Monte Carlo (MCMC) sampling, such as Random Walk Metropolis-Hastings (RW-MH) and Gibbs sampling, especially when the “shape” of the posterior isn’t a nice Gaussian peak. To understand why HMC outperforms these other methods, we have to understand why the other methods fail, so let’s have a refresher as to how they work.

Random Walk Metropolis-Hastings

The RW-MH method is a well-used method for generating samples to approximate the posterior distribution, and it is a very simple recipe to follow. From the current point:

Pick (i.e. sample) a distance and direction in the parameter space.
Move in a straight line in the sampled direction for the sampled distance to get to the new point.
Calculate ratio of posterior of current point to new point.
Accept or reject: move to the new point with probability equal to the ratio, otherwise stay at the current point. This is repeated until the MCMC chain has satisfactorily converged to the target distribution.

The distribution for sampling the distance and direction is called the kernel. If the kernel is symmetric, meaning that the probability of moving forward a given distance is equal to the probability of moving backward, then the acceptance probability is just equal to the ratio of posteriors. If the kernel is not symmetric, then the acceptance probability needs to incorporate this in order to satisfy a necessary property called detailed balance.

An interactive animation of RW-MH can be found here. Select “RandomWalkMH” for the algorithm and “banana” for the target distribution. The goal of the MCMC is have a satisfactory convergence of the sample distribution, given by the histograms on the bottom and left margins, to the target distribution, given by the corresponding densities. Run the chain for a while, and try to diagnose why it might take RW-MH a long time to converge. Consider that the perfect sampler would just be drawing from the target distribution itself, so if the current point is in the left tail, it is just would be just as likely to appear in the right tail for the next sample as the left. You can also change the kernel to sample smaller or larger distances using the “Proposal \(\sigma\)” algorithm option.

Gibbs Sampling

Gibbs Sampling is a special case of the general class of Metropolis-Hastings algorithms, with the very nice property that samples are always accepted. It can be used when we know the conditional posterior distribution of each parameter, conditioning on the values of the other parameters. The recipe for getting a new sample from the current sample is as follows:

Sample a new value of parameter 1 from its conditional posterior, holding the values of the other parameters at their current values.
Sample a new value of parameter 2 from its conditional posterior, holding parameter 1 at its new value and the other parameters at their current values.
Repeat for all parameters until all have new values.

To understand this graphically, consider the banana distribution from the previous example, where we have two parameters. To sample a new value of the horizontal parameter, holding the vertical parameter constant, we are essentially drawing a horizontal line at the height of the vertical parameter’s current value, and sampling the horizontal parameter from that line. To sample a new value of the vertical parameter, then we are drawing a vertical line at the horizontal parameter’s current value, and sample the vertical parameter from the line. Try and visualise this for a few iterations of sampling. Can you see why it might be hard for a Gibbs sampler to efficiently explore the target distribution?

Why are these samplers so slow?

Hopefully you’ve now had a think, and tried to visualise why these two algorithms can become inefficient, taking a long time to converge to the target distribution. The big reason is that the shape of the kernel distributions of the two algorithms doesn’t fit well with the shape of the posterior. The RW-MH selects the any direction with the same probability, but we can see that some directions point to high density areas while others point to very low density areas, and the result is a lot of rejections. We can sample smaller distances and improve the acceptance rate, but this can also make the algorithm less efficient, because it takes a long time to explore the target density. The Gibbs sampler always accepts new samples, but because it can only travel horizontally and vertically, it also takes a long time to get from the bend of the banana to the tails.

HMC: an “intelligent” sampler

In the end neither of these algorithms are smart enough to sample the banana-shaped target density. RW-MH doesn’t consider the target density’s shape, and Gibbs samplers can only move vertically and horizontally. The genius of HMC is that it uses the curvature of the target density itself to inform the kernel distribution, allowing for both a high acceptance rate and good exploration of the target density. The recipe for HMC is as follows:

Pick/sample a distance and direction, i.e. a momentum.
Move for a short time, then evaluate the slope/gradient, and change “momentum” based on current momentum and slope.
Repeat a number of times to arrive at a new point.
Calculate ratio of posterior of current point to new point.
Accept or reject: move to the new point with probability equal to the ratio (and kernel density ratio).

To visualise HMC, consider the target density as being like a 2 dimensional surface in 3D space, and invert it so the highest probability points are depressions. Consider that a ball on the surface would roll down into the depression, but if it was first “flicked” in some random direction, then it would roll around the surface, slowing down as it moved upwards to low probability regions, and speeding up as it rolls down to the high probability regions. In our idealised world there is no friction between the ball and the surface, so the ball never settles in the high probability regions but rolls around forever. This is essentially how HMC picks its new point, with a few caveats:

Instead of the target density, the “surface” is the logarithm of the target density.
We actually don’t need to roll our ball around forever, so we just say to some finite amount of “time”.
Computers generally can’t evaluate the trajectory of the ball to infinitesimal time-steps, so instead of a smooth trajectory it is approximated by a series of small linear steps.
Because the trajectory is an approximation, in order to satisfy detailed balance we need to have an accept-reject step. Good news is that for HMC the acceptance probability is the ratio of the posterior density values (new/old) multiplied by a relatively simple function of the initial momentum.

Don’t get it? surfaces, balls, momentums… I thought we were fitting statistical models? A visualisation might help here! The same interactive animation as before (link) has a Hamiltonian Monte Carlo setting. Change the “Algorithm” option to “HamiltonianMC”, tick “Animate Proposal” in the Visuallisation Options, and watch the HMC select new points by moving along a trajectory. Notice how it can get from one side of the banana to the other in relatively few steps compared to the RW-MH, and it’s because the kernel is informed by the target density itself.

Limitations of HMC

While Hamiltonian Monte Carlo represents a quantum leap in efficiency of MCMC compared to RW-MH and Gibbs sampling, it does have some issues that you should be aware of.

HMC is not (in itself) a solution for multimodal problems when the modes a far apart. HMC moves along a trajectory, so if there is a wide region of low density between the modes, the particle is unlikely to have the momentum to travel between them, so it rarely will. Unfortunately teleportation is not an option.
Tuning is required. The values of the number of steps and the length of “time” for each step will affect the efficiency of the algorithm. Increasing the number of steps will increase the acceptance rate, and thus the efficiency of the algorithm, however it will come at a higher computational cost. Lowering the amount of time per step will also increase the acceptance rate, but will make it slower to explore the target density, so it does not necessarily increase the efficiency. Fortunately, Stan uses the burn-in period to adaptively tune the time per step and number of steps, so you generally won’t have to worry about this.
HMC is more computationally intensive, so it is generally written in a lower level language like C++. Again, Stan has you covered, so as long as you can write your model in the Stan language, it can handle the rest.

The No-U-Turns Sampler (NUTS)

Stan actually uses a more sophisticated algorithm than HMC, but HMC is still at its core. NUTS improves on the efficiency of HMC by stopping trajectories when they start to turn back on themselves. The details of the sampler are beyond the scope of this tutorial, however it does alleviate the tuning issues mentioned previously in addition to improved efficiency. The only major related consideration for simple uses of Stan is the max_treedepth tuning parameter, which can improve acceptance probability and efficiency per iteration if increased, however it results in a substantial increase in the computation time.

When TO and NOT TO use Stan

Stan is a powerful and flexible sampling package, however it is suited more to some situations than others. Stan will be useful to you if:

You need greater flexibility in the model and/or prior specification. If you can write it down in Stan’s language, Stan will handle the rest!
You need greater efficiency than other samplers. Stan is much faster than other sampling packages such as JAGS or WinBUGS.

Some situations where Stan may NOT be helpful:

discrete distributions: An assumption of HMC is that the parameter space is continuous, and as a result Stan doesn’t do well with discrete distributions. As a result, JAGS or WinBUGS may be a better option.
missing values: Imputation of missing values is a little more complicated on Stan since the data block assumes that the elements are known. See the Stan chapter 3.1 Missing Data for more details.
Variable time intervals (ODEs): Unlike MATLAB, Stan’s Ordinary Differential Equation (ODE) solver will only allow for fixed time intervals.
Divergences: Certain posterior geometries can be too ``sharp’’ for Stan to explore, and result in excessive divergences. If this occurs, reparameterisation or changing the model may be necessary. The more sophisticated method Reimannian Manifold HMC may also be useful, however this will need to be hand-coded.

Resources

You can find a more complete overview of Hamiltonian Monte Carlo in this ArXiv paper by Radford Neal, or for a an indepth overview of the geometrical concepts and theory try Michael Bentancourt’s ArXiv paper.

For linear and generalised linear models, you can use the computational power of Stan without ever leaving the comfort of traditional R formula using the brms R package.

The Stan website contains a huge amount of information, including examples, installation notes for the R implementation of Stan as well as others, documentation including the reference manual, and a forum for questions and debugging.

In our experience the developers at Stan have been very friendly, so if you find yourself stuck and can’t resolve the issue via the forum or reference manual, you can contact the developers to help you out.

If you want to learn more about Bayesian statistical inference, a good book to try is Bayesian Data Analysis by Andrew Gelman. If you are QUT staff or HDR, then you can find the most recent edition online via the library.

Acknowledgements

The basketball dataset was provided by Wade Hobbs from Australian Institute of Sport (AIS).

How to reference this workshop

Aminath Shausan, Edgar Santos Fernandez, Daniel Kennedy (2019), Bayesian Research and Applications Group, QUT. https://danwkenn.github.io/brag-stan-guide/

Appendices

Solution to exercise

The following Stan model should allow us to estimate the model parameters and forecast temperature for proceeding years:

data {
  int<lower=0> N;
  int<lower=2> N_new;
  vector[N] y;
}

parameters {
  real gamma;
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  for (n in 2:N)
    y[n] ~ normal(alpha + beta * y[n-1] + gamma * n, sigma);
}

generated quantities{
  real y_new[N_new];
  y_new[1] = normal_rng(alpha + beta * y[N] + gamma * (N + 1), sigma);
  for (n in 2:N_new){
    y_new[n] = normal_rng(alpha + beta * y_new[n-1] + gamma * (n + N), sigma);
  }
}

We now compile the model and sample for 5000 iterations on 4 parallel chains:

ts_ar1.stanfit <- readRDS(file = "data/pre-sampled-ts_ar1.RDS")

We can now use the fabulous ggmcmc package to plot our MCMC chains and make sure that they have converged:

library(ggmcmc)

ggs_traceplot(ggs(ts_ar1.stanfit),family = c("gamma|alpha|beta|sigma"))

ggs_autocorrelation(ggs(ts_ar1.stanfit),family = c("gamma|alpha|beta|sigma"))

ggs_density(ggs(ts_ar1.stanfit),family = c("gamma|alpha|beta|sigma"))

We can get numerical summaries for your parameters as well:

print(ts_ar1.stanfit,pars = c("gamma","alpha","beta","sigma"))

## Inference for Stan model: ts_ar1.
## 4 chains, each with iter=5000; warmup=1000; thin=1; 
## post-warmup draws per chain=4000, total post-warmup draws=16000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## gamma  0.03    0.00 0.01  0.01  0.03  0.03  0.04  0.05  7175    1
## alpha 44.64    0.09 6.83 31.13 40.07 44.56 49.30 57.86  5353    1
## beta   0.11    0.00 0.14 -0.16  0.02  0.11  0.20  0.38  5311    1
## sigma  1.14    0.00 0.11  0.95  1.06  1.13  1.21  1.38  6644    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 15:41:42 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

We therefore find there is compelling evidence for a linear trend, where temperatures rise by 0.03(95% CI: 0.01,0.05)\(\circ\)F per year. There is also very little evidence of auto-correlation, once the linear trend is taken into account, as the credible interval for \(\beta\) covers 0 (-0.17,0.37).

We can extract and plot our forecasted temperatures using the As.mcmc.list() function. There may be a simpler way of converting the MCMC samples to a data-frame, however using the do.call() and lapply() functions are used here:

predictions <- As.mcmc.list(ts_ar1.stanfit,pars = "y_new")
predictions <- 
  do.call(
    rbind.data.frame, 
    lapply(1:4, function(x){return(cbind(predictions[[x]],chain = x,iter = 1:nrow(predictions[[x]])))}))

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

predictions <- predictions %>% melt(id = c("chain","iter"))
# Convert the year to a numeric variable:
levels(predictions$variable) <- 1971 + 1:length(levels(predictions$variable))
predictions$variable <- as.numeric(as.character(predictions$variable))

# Calculate the mean, upper and lower quartiles, and bounds for a 90% Prediction Interval.
prediction_summary <- 
predictions %>% group_by(variable) %>% 
  summarise(mean = mean(value),
            upper_q = quantile(value,0.75),
            lower_q = quantile(value,0.25),
            upper_ci = quantile(value,0.95),
            lower_ci = quantile(value,0.05))

# Plot the real data:
plot <- ggplot() + geom_line(data = data.frame(y = as.vector(nhtemp), x = 1912:1971),aes(x=x,y=y))

# Plot the mean prediction
plot <- plot + geom_line(data = prediction_summary, mapping = aes(x = variable, y = mean))

# Plot the upper and lower quartiles as a ribbon:
plot <- plot + geom_ribbon(
  data = prediction_summary, 
  mapping = aes(x = variable, ymin = lower_q, ymax = upper_q),alpha = 0.3, fill = "red")

# Plot the 90% Prediction interval as a lighter ribbon.
plot <- plot + geom_ribbon(
  data = prediction_summary, 
  mapping = aes(x = variable, ymin = lower_ci, ymax = upper_ci),alpha = 0.2, fill = "red")
plot <- plot + labs(y = expression(Temperature ~(degree~"F")),x = "Year")
plot