This tutorial provides a basic introduction to the programming language Stan. Stan is a free and open-source probabilistic modelling language specifically used for Bayesian statistical inference, including MCMC optimisation and variational inference. It has received much attention and as a faster platform for Bayesian inference compared to others such as JAGS and WinBUGS, through the use of Hamiltonian Monte Carlo (HMC) sampling.
In this tutorial, we will learn how to implement Bayesian analysis in Stan from defining the model to reading outputs. First, we define the model and prior information as code blocks in Stan, then submit the model for computation and sampling in R. Afterwards, we will explore a selection of simple model fit examples applicable to practical situations, including sports, education and disease spread. Finally, we will learn about the advantages and limitations of Stan language.
From our experience, it may take a while to setup RStan, so we recommend you to set up RStan beforehand. You will need to:
.libPaths("path/to/directory")
Other packages required for this tutorial are: tidyverse, bayesplot, deSolve.
Stan models can be compiled in other platforms such as Python, MATLAB, Julia, Stata, Mathematica, or Scala. However, the examples discussed in this tutorial are run using RStan (the R interface to Stan) on Windows operating system. We do not cover theory of Bayesian statistics nor MCMC convergence, but expect users to have a basic understanding of these theories.
If you would like to use HPC during this tutorial, the procedure to follow is same as any other programming language. You may use the BRAG HPC guide for this setup. We do not cover how to setup Stan on Mac or any other operating system.
To start the tutorial:
brag-stan-guide.Rproj
.Stan models are specified in several blocks. The main ones are:
data
parameters
transformed parameters
model
generated quantities
To illustrate the use of these blocks, let’s set up a model for a coin-flip experiment.
library(rstan)
library(knitr)
flip_data <- c(0,1,0,1,1,0,0,1,1,0)
We first need to tell Stan what our data look like, and every quantity in Stan is declared as a particular type of quantity (int
, or real
) and bounds are given if appropriate. To specify a vector, we can use the square brackets ([]
) with the length of the vector inside the brackets.
data {
int<lower=0> n_trials; // number of coin-flip trials
int<lower=0,upper=1> Y[n_trials]; // data (heads = 1, tails = 0)
}
We now specify the parameters of the model. Suppose for illustration that instead of the standard Bernoulli parameter \(\theta \in \left[0,1\right]\) we are interested in the logit-probability parameter \(\phi\): \[
\phi = \text{logit}{\left(\theta\right)} = \text{log}{\left(\frac{\theta}{1-\theta}\right)}
\] We declare \(\phi\) in the parameter
block:
parameters {
real phi; // logit-probability as parameter of interest
}
The transformed parameter
block declares and defines any values which are a function of the parameters. Stan’s bernoulli
function doesn’t accept the logit-probability \(\phi\), so we will define \(\theta\) here to use it when defining the model. We use upper and lower bounds to precisely define the values which \(\theta\) can have:
transformed parameters {
real<lower=0,upper=1> theta;
theta = exp(phi)/(1+exp(phi)); // probability as a function of logit_prob.
}
We are now ready to define the model. This is the generating process of the data, based on the parameter values. The prior distributions of the parameters are also defined in this block. Stan is quite nice in that the sampling distribution of a vector can be defined in a single line, as well as the traditional for-loop method from JAGS and WinBUGS.
model {
phi ~ normal(0,10^4); // prior for phi is a wide (SD = 10^4) normal distribution centered at 0.
Y ~ bernoulli(theta); // define the likelihood with vector statement.
// You can also define using a for loop, looping over each trial:
// for(trial in 1:n_trials){
// Y[trial] ~ bernoulli(theta); // define
// }
}
Finally, any quantities generated after sampling has occurred, such as predicted values for new data, or transformations of the parameter(s) not necessary for the model{}
block. We will here generate the relative odds, as well as a predicted value for a new trial.
generated quantities {
int<lower=0,upper=1> Y_dash; // Prediction for new trial
real<lower=0> relative_odds; // Relative odds (how likely is heads compared to tails?)
relative_odds = prob/(1-prob);
Y_dash = bernoulli_rng(prob);
}
The above blocks have been saved in a text file called coin_flip.stan
. By using the .stan
file type, Rstudio knows it is a Stan model and applies the correct formatting to it. It also provides a syntax-checking function, so it can tell you when you’ve left off a “;
” and the end of lines.
To run the code, we can either feed it straight into the stan()
function:
cf_result.stanfit <- stan(
file = "models/coin_flip.stan",
data = list(
n_trials = length(flip_data),
Y = flip_data
))
or we can compile the model first:
cf.stanmodel <- stan_model(
file = "models/coin_flip.stan"
)
and then use the sampling()
function:
cf_result.stanfit <- sampling(
object = cf.stanmodel,
data = list(
n_trials = length(flip_data),
Y = flip_data
))
If the underlying file models/coin_flip.stan
doesn’t change, then Stan will only compile the model the first time stan()
or stan_model()
are called.
There are two more blocks which may be of use: - functions
: For defining your own functions (see Stan Reference Manual Chapter 9 for details) - transformed data
: For declaring and defining intermediate values which are functions of your data. For example, for positive skewed data, you might model the log of the data instead of the raw data:
transformed data {
real log_Y[n_trials];
log_Y = log(Y)
}
In this example, we use basketball shooting data to illustrate the use of a Bayesian logistic regression model.
Loading the R libraries we need:
library("ggplot2")
library("rstan")
library("bayesplot")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source('source-code/0 utils.R') # to plot the basketball court
set.seed("314413") #making the results reproducible
We import the shooting data from the file called basketball.csv
into a data frame called df
and we check the structure of the variables.
df <- read.csv("data/basketball.csv")
#str(df)
df$outcome <- factor(as.character(df$biscorecounter))
We formulate the hypothesis that some of the independent variables or predictors play a crucial role in the shooting effectiveness (success or fail). For instance, we known that the distance from the basket of the player taking the shot (dis
) is relevant to the outcome. We assess the impact of the predictors on the response variable outcome
.
A summary of four covariates and the response variable is shown below:
Variable | type | Description |
---|---|---|
outcome | binary | scoring (1), missing(0) |
dis | numeric | distance from the basket (in foot) |
rad | numeric | angle (in radians). Zero corresponds to the line from basket to basket |
postime | integer | possession time from 0 to 25 seconds |
dis_trav | numeric | distance traveled (in foot) |
We visually inspect the distribution of the four covariates in successful (1) and missed shots (0) using box plots.
par(mfrow = c(2,2), oma = c(2,4,3,1.5) , mar = c(1.5,2,1.5,1.5))
boxplot(dis ~ outcome, data = df, main= "distance", ylab= "distance")
boxplot(rad ~ outcome, data = df, main="angle", ylab="angle")
boxplot(postime ~ outcome,data = df, main="possession time", ylab="possession time")
boxplot(dis_trav ~ outcome, data = df, main="distance travelled", ylab="distance travelled")
P_180 +
geom_jitter(data = df, aes(x = y-10, y = -x+6,
shape = outcome,
col = outcome), alpha= 0.2) +
theme_bw() + xlim(0,47)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
We define the model in Stan language as follows. Since the response is binary, we use the Bernoulli likelihood (bernoulli_logit
function). We use a logistic regression model with an intercept \(\beta_1\) and slopes \(\beta_2\), \(\beta_3\), etc.
model <- '
data {
int<lower=0> N;
vector[N] dis;
vector[N] rad;
vector[N] postime;
vector[N] dis_trav;
int<lower=0,upper=1> outcome[N];
}
parameters {
real beta[5];
}
model {
outcome ~ bernoulli_logit(beta[1] + beta[2] * dis + beta[3] * rad + beta[4] * postime + beta[5] * dis_trav);
//priors
beta ~ normal(0, 100);
}
'
In rjags/BUGS the model would be defined as follows:
model_jags = "
model {
for (i in 1 : N) {
outcome[i] ~ dbern(p[i]) #likelihood
logit(p[i]) <- beta[1] + beta[2] * dis[i] + beta[3] * rad[i] + beta[4] * postime[i] + beta[5] * dis_trav[i]
}
# Priors
for (j in 1 : 5) {
beta[j] ~ dnorm(0, 1 / 100 ^ 2)
}
}"
We then create a list containing the number of observations, the predictors, and the response.
data <- list(N = nrow(df), # total numb of shots
outcome = as.numeric(as.character(df$outcome)),
dis= df$dis,
rad = df$rad,
postime = df$postime,
dis_trav = df$dis_trav
)
To fit the model in stan
, we have to specify the model, the data list created above. We also need the other parameters e.g. the number of chains and iterations, etc.
fit <- stan(model_code = model,
model_name = "model",
pars = 'beta',
data = data,
iter = 5000,
warmup = 2000,
thin = 2,
chains = 3,
verbose = F,
seed = 2019,
refresh = max(5000/10, 1)
)
stats <- summary(fit)
stats <- stats$summary
kable(stats[1:5,])
mean | se_mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|---|---|---|
beta[1] | 0.6684490 | 0.0034865 | 0.2229762 | 0.2348632 | 0.5144865 | 0.6658918 | 0.8156175 | 1.1086425 | 4090.059 | 0.9999769 |
beta[2] | -0.0513403 | 0.0001541 | 0.0101798 | -0.0713097 | -0.0581530 | -0.0512236 | -0.0443981 | -0.0313309 | 4362.449 | 1.0000693 |
beta[3] | -0.1544514 | 0.0012310 | 0.0788210 | -0.3090133 | -0.2077561 | -0.1535710 | -0.1018914 | -0.0042478 | 4099.631 | 1.0000707 |
beta[4] | -0.0758065 | 0.0004416 | 0.0280942 | -0.1314366 | -0.0948278 | -0.0759443 | -0.0564092 | -0.0224400 | 4046.669 | 0.9998445 |
beta[5] | 0.0038214 | 0.0000501 | 0.0031118 | -0.0023219 | 0.0017390 | 0.0038116 | 0.0058796 | 0.0099789 | 3854.495 | 0.9997607 |
Performing then a diagnostic of the chains by visually inspecting the trace and the posterior density plots.
color_scheme_set("blue")
posterior <- as.array(fit)
mcmc_dens_overlay(posterior, pars = c('beta[1]', 'beta[2]',
'beta[3]', 'beta[4]',
'beta[5]'))
mcmc_trace(posterior, pars = c('beta[1]', 'beta[2]',
'beta[3]', 'beta[4]',
'beta[5]'))
mcmc_acf(posterior, pars = c('beta[1]', 'beta[2]',
'beta[3]', 'beta[4]',
'beta[5]'), lags = 10)
This is a hierarchical Bayesian model, which is used to study an educational experiment data which is described in Gelman et al. (2013). The purpose of the study was to examine the effects of a test preparation program conducted in eight different high schools in New Jersey. A separate randomised experiment was conducted in each school, and the administrators of each school implemented the program in their own way. The model is described as : \[ \begin{align*} y_j \sim Normal(\theta_j, \sigma_j^{2}), \\ \theta_j \sim Normal(\mu, \tau^{2}), \\ \mu \sim Normal(0, 5), \\ \tau \sim HalfCauchy(0,5), \end{align*} \] where \(j = 1,..., 8\), The measurements \(y_j\) and uncertainties \(\sigma_j\) are the estimates and standard errors from separate regressions performed for each school, as shown in in the following table.
School | Estimated effect, \(y_j\) | Standard error of effect, \(\sigma_j\) |
---|---|---|
A | 28 | 15 |
B | 8 | 10 |
C | -3 | 16 |
D | 7 | 11 |
E | -1 | 9 |
F | 1 | 11 |
G | 18 | 10 |
H | 12 | 18 |
This is a well known example for Stan users. We use this example here to illustrate two things. First, to use the Stan program to generate data from the model, which may be used for prior predictive analysis or to validate the probabilistic model. Second, to study the effect of centered and non-centered parametrisation of the model on MCMC diagnostics. We have already saved the data as schools_data.RData
in the data
folder.
Sometimes we may wish to generate fake data from our probabilistic model to validate our model. For this, parameters are drawn from the priors, and then estimated afterwards from the model. The problem with this approach is that usually we have to consider two bits of code; data generation model and the estimation model. If we make a change in one model, we must reflect the change in the other model. This may not be an issue for simple models, but may be an issue for complex models.
We can fix this issue by making use of Stan’s generated quantities
block, and using one piece of code for both purposes. We do this by evaluating the likelihood conditionally. If the likelihood is not evaluated, then the posterior predictive draws are same as the prior predictive draws - that is the fake data.
We use the following Stan program to simulate fake data from the eight school hierarchical model, illustrating the use of one piece of code for simulation and estimation.
//saved as 8school_centered.stan
data {
int<lower=0> J; //number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // sd. error of effect estimates
int<lower =0, upper =1> run_estimation; //a switch to evaluate likelihood (=1) or not (=0)
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
//priors
mu ~ normal(0,5);
tau ~ cauchy(0,5);
theta ~ normal(mu, tau);
//evaluate the likelihood conditionally
if(run_estimation == 1){
y ~ normal(theta, sigma);
}
}
generated quantities{
real y_sim[J]; //array to hold simulated data
for(j in 1:J) {
y_sim[j] = normal_rng(theta[j], sigma[j]);
}
}
The value assigned to the variable run_estimation
determines if the likelihood is evaluated or not. If the value given to this variable is \(0\), then the likelihood is not computed, and the samples returned by y_sim
, defined in the generated block
, are the fake data. If the value of run_estimation
is \(1\), then the likelihood is evaluated and the samples returned by y_sim
are the posterior predictions. Let us generate some data using this Stan program.
# first load rstan
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
# then load the school data
load('data/schools_data.RData')
#prepare data to pass to Stan program
stan_data <- list(J = 8,
y = schools_data$y,
sigma = schools_data$sigma,
run_estimation = 0 #switch off likelihood
)
schools_sim_data <- stan("models/8school_centered.stan", #the Stan program
data = stan_data, #data to be used by Stan program
chains = 1, #number of MCMC chains
iter = 1000, #number if iterations per chain
control = list(adapt_delta = 0.99, max_treedepth = 15)
#warmup = 1000
)
#save(schools_sim_data, file = "results/schools_sim_data.RData", envir = .GlobalEnv)
# extract simulated data
fake_data <- as.data.frame(extract(schools_sim_data, pars = 'y_sim'))
Use the fake data corresponding to one of the iteration (eg. 1st iteration) to validate the model. For this, you will need to assign the value \(1\) to run_estimation
variable.
Note that the above Stan program provides a centered parametrisation of the hierarchical model. Let us fit actual data using the above Stan program.
# prepare data to pass to Stan program
stan_data <- list(J = 8,
y = schools_data$y,
sigma = schools_data$sigma,
run_estimation = 1 #switch off likelihood
)
schools_fit <- stan("models/8school_centered.stan", #the Stan program
data = stan_data, #data to be used by Stan program
chains = 4, #number of MCMC chains
iter = 2500, #number if iterations per chain
#control = list(adapt_delta = 0.99, max_treedepth = 15)
warmup = 1000
)
## Warning: There were 74 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
There is a warning message regarding some divergent transitions. Divergences in HMC arise when the Hamiltonian transition encounters regions of extremely large curvature. Stan uses a heuristic to quickly identify these misbehaving trajectories, and hence label divergences, without having to wait for them to run all the way to infinity. Sometimes this issue can be resolved by adjust the step size, of the Hamiltonian transition. The smaller the step size the more accurate the trajectory and the less likely it will be mislabeled as a divergence. In order to do this, we can increase the adapt_delta
parameter from its default value of \(0.8\) closer to its maximum value of \(1\).
In some cases, divergences can be fixed by increasing the adapt_delta
parameter, but in other cases the model needs to be reparametrised to overcome divergences. If adapt_delta
is increased the step size is decreased. To visualise where divergences occur we can use pairwise plots.
pairs(schools_fit, pars = c("theta[1]","tau"), las = 1)
We can see that most divergences (red spots) occurs when \(\tau\) is near \(0\). Let us now use a non-centered parametrisation of the model to see if divergences are reduced. In non-centered parametrisation, we do not try to fit the hierarchical parameter (\(\theta\)) directly, but we fit an extra parameter \(\hat{\theta}_{j} \sim Normal(0,1)\), from which we can recover the hierarchical parameter using \(\theta_{j} = \mu + \tau * \hat{\theta}_{j}\). We recover the parameter in the transformed parameters
block.
//saved as 8school_non_centered.stan
data {
int<lower=0> J; //number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // sd. error of effect estimates
int<lower =0, upper =1> run_estimation; //a switch to evaluate likelihood (=1) or not (=0)
}
parameters {
real mu;
real<lower=0> tau;
real theta_hat[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_hat[j];
}
model {
//priors
mu ~ normal(0,5);
tau ~ cauchy(0,5);
theta_hat ~ normal(0,1);
//evaluate the likelihood conditionally
if(run_estimation == 1){
y ~ normal(theta, sigma);
}
}
generated quantities{
real y_sim[J]; //array to hold simulated data
for(j in 1:J) {
y_sim[j] = normal_rng(theta[j], sigma[j]);
}
}
Let us run the new model.
schools_fit_ncp <- stan("models/8school_non_centered.stan", #the Stan program
data = stan_data, #data to be used by Stan program
chains = 4, #number of MCMC chains
iter = 2500, #number if iterations per chain
#control = list(adapt_delta = 0.99, max_treedepth = 15)
warmup = 1000
)
#save(schools_fit_ncp, file = "results/schools_fit_ncp.RData", envir = .GlobalEnv)
No divergences are produced.
In this example, we fit a simple Susceptible-Infective-Recovered (SIR) epidemic model to simulated data. The SIR epidemic model is used to describe the spread of self limiting diseases, such as measles, within a human population. The model comprises of three types of individuals; susceptible, infected and recovered. Susceptible individuals become infected when in contact with infected individuals. Infected individuals remain infectious for a period of time and then recover with life long immunity. For the example considered here, let \(S(t)\), \(I(t)\) and \(R(t)\) be the proportion of susceptible, infected and recovered individuals, respectively, at time \(t\). We assume that the population is closed, so that \(S(t) +I(t)+R(t) =1\). The dynamics of this SIR epidemic is described by the following system of ordinary differential equations (ODEs). \[
\begin{align}
\frac{dS}{dt} = & \, - \beta S I \nonumber\\
\frac{dI}{dt} = & \, \beta S I - \gamma I\nonumber\\
\frac{dR}{dt} = & \, \gamma I \nonumber
\end{align}
\] Parameters \(\beta\) and \(\gamma\) correspond to the infection and recovery rates, respectively. Let us first visualise the dynamic of these equations, using given parameter values and initial conditions. In practice, these are not know, but can be estimated from data. We use the deSolve package developed for R
interface to solve the system.
library(deSolve)
library(dplyr)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores()) #executes multiple Markov chains in parallel
#-----------------------------------
#Follow the following two steps to repare data to feed into R's 'ode()' function (see the 'ode' function help page for more details)
#1. define the SIR ODE system as a function
SIR_model <- function(times, y, pars) {
with(as.list(c(pars, y)), {
dS <- - beta * y[1] * y[2]
dI <- beta * y[1] * y[2] - gamma * y[2]
dR <- gamma * y[2]
res <- c(dS,dI,dR)
list(res)
})
}
#2. define initial condition of ODEs and parameters
I0 <- 0.01 # initial fraction infected
S0 <- 1 - I0 # initial fraction susceptible
R0 <- 0
pars <- c(beta = 0.5, gamma = 0.1) #parameters
inits <- c(S0, I0, R0) #initial condition of ODE
times <- 0:60 #time span of the epidemic (days)
# Solve the ODEs using 'ode' function
SIR_dynamics <- ode(y = inits, times=times, func = SIR_model, parms=pars, method="ode45")
#store the output in a data frame for later use
SIR_dynamics <- data.frame(SIR_dynamics)
colnames(SIR_dynamics) <- c("time", "S", "I", "R")
Let us now visualise the dynamic of these equations.
plot(NA,NA, xlim = c(0, 60), ylim=c(0, 1), xlab = "Time (days)", ylab="Proportion of population")
lines(SIR_dynamics$S ~ SIR_dynamics$time, col="black")
lines(SIR_dynamics$I ~ SIR_dynamics$time, col="blue")
lines(SIR_dynamics$R ~ SIR_dynamics$time, col="red")
legend(x = 40, y = 0.9, legend = c("Susc.", "Inf.", "Recov."),
col = c("black", "blue", "red"), lty = c(1, 1,1), bty="n")
This gives us the true epidemic dynamics of the population. In practice, it may not be possible to determine the proportion infected at each day, nor to determine disease status of all individuals at a given time. Rather, we can take a sample of individuals from the population at various time points and find out how many are infected. We expect binomially distributed error in these estimates, with probability of success given by the infected dynamic of the ODEs. The sampling can be done as follows.
n_obs <- 30 # number of days sampled
n_sample <- 50 # number of individuals sampled per day
# Choose which days the samples are taken.
ts <- sort(sample(1:60, n_obs, replace=F))
# Extract the "true" fraction of the population infected at each of the sampled days:
prop_inf <- SIR_dynamics[SIR_dynamics$time %in% ts, 3]
# Generate binomially distributed data.
sample_y <- rbinom(n_obs, n_sample, prop_inf)
Now that we have some simulated data, let us estimate parameters \(\beta\), \(\gamma\) and initial conditions of the ODE using the following Stan program, which is a slightly different version available here. The Stan model is already saved in a text file called SIR_fit.stan
, in the models
folder.
functions {
real[] SIR_ode(real t, //time
real[] y, //state
real[] params, //parameters
real[] x_r, //real data
int[] x_i) { //integer data
real dydt[3];
real beta = params[1];
real gamma = params[2];
dydt[1] = - beta * y[1] * y[2];
dydt[2] = beta * y[1] * y[2] - gamma * y[2];
dydt[3] = gamma * y[2];
return dydt;
}
}
data {
int<lower = 1> n_obs; // number of days sampled
int<lower = 1> n_sample; // number of individuals sampled at each time point.
int y[n_obs]; // observed data
real t0; // initial time point
real ts[n_obs]; // time points that were sampled
int<lower = 1> n_pred; // number of predicted data points
real t_pred[n_pred]; // time points at which predictions are made
}
transformed data {
real x_r[0];
int x_i[0];
}
parameters {
real<lower = 0> params[2]; // the two model parameters, constrained to be positive
real<lower = 0, upper = 1> I0; //initial fraction of infectives, constrained to be between 0 and 1
}
transformed parameters{
real y_hat[n_obs, 3]; // output from the ODE solver
real y0[3]; // initial conditions of ODE
y0[1] = 1-I0;
y0[2] = I0;
y0[3] = 0;
// solve ODE for observed time points
y_hat = integrate_ode_rk45(SIR_ode, y0, t0, ts, params, x_r, x_i);
}
model {
//priors
params ~ normal(0,2);
I0 ~ normal(0.5, 0.5);
//likelihood
y ~ binomial(n_sample, y_hat[, 2]);
}
generated quantities {
real y_pred[n_pred, 3]; //prediction from ode
// Generate predicts over the entire time span:
y_pred = integrate_ode_rk45(SIR_ode, y0, t0, t_pred, params, x_r, x_i);
}
Note that the functions
block defines the system of ODEs as a function (SIR_ode
) with specific signature. The function takes in a time t
(a real value, for time dependent parameters), a system state y
(real array), system parameters params
(real array), real data (x_r
) and integer data (x_i
). The SIR_ode
function returns the array of derivatives of the system with respect to time, evaluated at time t
and state y
. The SIR
equations do not have time dependent parameters nor use real or integer data, so these are not used. Nevertheless, these arguments must be included in the system function.
The data
block defines a real value t0
, which is passed to Stan’s ODE solver integrate_ode_rk45
. We set this value as \(0\) for this example. It also defines number of predicted data points, n_pred
, and the predicted time points, t_pred
, both are used in the generated quantities block
.
The transformed data
block consists of two empty data arrays. These empty arrays can also be defined in the ODE solver as:
y_hat = integrate_ode_rk45(SIR_ode, y0, t0, ts, params, x_r, x_i,
rep_array(0.0, 0), rep_array(0, 0));
We include these in the transformed block
for illustration.
The solutions to the SIR
equations are defined as transformed parameters
. This allows them to be used in the model
block and inspected in the output. We use the Runge-Kutta solver, integrate_ode_rk45()
, to solve the system of ODEs. The Runge-Kutta solver is used to solve non-stiff systems. To compute solutions, this function requires the ODE function (defined as SIR_ode
), initial state (y0
), initial time (t0
), requested solution times (ts
), parameters (params
) and real and integer data (x_r
, x_i
, respectively). The value assigned to t0
must be less than the values assigned for ts
. For stiff ODEs
, the solver used is integrate_ode_bdf()
function, with same arguments as used for the integrate_ode_rk45()
. Both integration functions are limited as to the origin of variables used in their arguments. The arguments ts
, x_r
and x_i
must be expressions that only involve data or transformed data variables. The initial state y0
and parameters params
are the only arguments which may involve parameters.
The model
block defines priors and the likelihood. Finally, we compute predictions, using the fitted values, in the generated quantities
block. Rather than computing predictions at the sampled time points, we make predictions for the entire epidemic time span.
To fit the model, we need to assign values for the variables defined in the data
block.
data <- list(n_obs = n_obs, #number of days sampled
n_sample = n_sample, #number of individuals sampled
y = sample_y, #observed data from binomial distribution
t0 = 0, #initial time for ODE
ts = ts, #time points at which observations are made
n_pred = length(1:60), #number of predictions
t_pred = c(1:60) #times at which predictions are made
)
#save(data, file = "data/SIR_data.RData", envir = .GlobalEnv)
We have saved the data as SIR_data.RData
in the data
folder. You may wish to use the saved data for model fitting. Let us first test the Stan program using a shorter chain.
test <- stan("models/SIR_fit.stan", #the path and name of the Stan program
data = data, #list containing data
chains = 1, #number of MCMC chains
iter = 10, #number if iterations per chain
control = list(adapt_delta = 0.8, max_treedepth = 10)
)
No errors are produced for this test run. We can now fit the model using multiple chains with more iterations.
SIR_fit <- stan(fit = test,
data = data,
chains = 4,
warmup = 1000,
iter = 2500
)
#save(SIR_fit, file = "results/SIR_fit.RData", envir = .GlobalEnv)
We have saved result of this fit as SIR_fit.RData
, in the results
folder. We must now check some MCMC convergences and if parameters used to simulate data can be recovered. The traceplot()
and print()
functions can be used for these checks.
traceplot(SIR_fit, pars = c("params", "I0", 'lp__'), inc_warmup =TRUE)
print(SIR_fit, pars = c("params", "y0"), prob = c(0.025, 0.5, 0.975),digits=3)
## Inference for Stan model: SIR_fit.
## 4 chains, each with iter=2500; warmup=1000; thin=1;
## post-warmup draws per chain=1500, total post-warmup draws=6000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## params[1] 0.444 0.001 0.036 0.378 0.442 0.518 2476 1.001
## params[2] 0.103 0.000 0.006 0.093 0.103 0.115 3567 1.000
## y0[1] 0.981 0.000 0.007 0.965 0.982 0.992 2186 1.001
## y0[2] 0.019 0.000 0.007 0.008 0.018 0.035 2186 1.001
## y0[3] 0.000 NaN 0.000 0.000 0.000 0.000 NaN NaN
##
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 10:34:44 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The trace plots show that the chains mixed well and all chains converging to same posterior. Output from the print
function shows posterior summary along with Rhat
value for each parameter. The median estimate are \(\beta = 0.442\) and \(\gamma = 0.103\), \(S(0) = 0.982\), \(I(0) = 0.018\) and \(R(0) = 0\), which are close to the values used to generate data. The Rhat
statistic is the ratio of between-chain variance to within-chain variance and should be approximately \(1\pm 0.1\) if the chains have converged, which is true for this fit.
Let us now plot posterior predictions to see how well the model fits observed data.
#extract the predicted posterior samples
post <- extract(SIR_fit, pars= ('y_pred'))
#compute the 95% posterior credible interval
cred_median <- apply(post$y_pred[,,2], 2, median)
cred_low <- apply(post$y_pred[,,2], 2, quantile, probs=c(0.025))
cred_high <- apply(post$y_pred[,,2], 2, quantile, probs=c(0.975))
#create two data frames; one for observed data and one for predicted data
df_obs <- data.frame(y = sample_y/n_sample, time = data$ts)
df_pred <- data.frame(cred_median, cred_low, cred_high, time = data$t_pred)
#plot the credible interval with observed data
ggplot(df_obs, aes(x=time, y=y)) +
geom_point(col="black", shape = 19, size = 1.5) +
geom_line(data = df_pred, aes(x=time, y=cred_median), color = "blue") +
geom_line(data = df_pred, aes(x=time, y=cred_high), color = "blue", linetype=3) +
geom_line(data = df_pred, aes(x= time, y=cred_low), color = "blue", linetype=3) +
labs(x = "Time (days)", y = "Proportion infected")
The Stan model can recreate the observed dynamics reasonably well. In this example, we described the SIR
system of ODEs using fractions of the population, instead of numbers. This allowed us to estimate initial fractions of the ODEs, as well as ODEs parameters (\(\beta\) and \(\gamma\)), as Stan can sample only continuous parameters. Currently, Stan cannot handle integer parameters.
## [1] FALSE
data("nhtemp")
plot(nhtemp,ylab = expression(Temperature ~(degree~"F")),xlab = "Year")
The nhtemp
data-set in R is a time series of temperatures (measured in \(^\circ\)F) collected in New Haven, Connecticut over 60 years from 1912-1971. The proposed model for the process is that the temperature depends on the previous year’s temperature as well global temperatures, which are assumed to rise linearly each year: \[
y_i = \alpha + \beta y_{i-1} +\gamma i + \epsilon_i,
\] where \[
\epsilon_i \sim \text{Normal}{\left(0,\sigma^2\right)}.
\]
Explanation for each parameter is:
Create a Stan model and use MCMC sampling in R to:
Use this code as a template for your Stan model:
data {
int<lower=0> N;
int<lower=2> N_new;
vector[N] y;
}
parameters {
FILL
}
model {
for (n in 2:N){
y[n] ~ normal(FILL, FILL);
}
}
generated quantities{
real y_new[N_new];
y_new[1] = normal_rng(FILL, FILL);
for (n in 2:N_new){
y_new[n] = normal_rng(FILL, FILL);
}
}
To learn more about fitting a time series in Stan, check out the Time Series chapter in the Stan User’s Guide.
Stan uses Hamiltonian Monte Carlo (HMC) in order to generate samples from the posterior distribution of our model parameters. HMC tends to be much more efficient than other types of Markov Chain Monte Carlo (MCMC) sampling, such as Random Walk Metropolis-Hastings (RW-MH) and Gibbs sampling, especially when the “shape” of the posterior isn’t a nice Gaussian peak. To understand why HMC outperforms these other methods, we have to understand why the other methods fail, so let’s have a refresher as to how they work.
The RW-MH method is a well-used method for generating samples to approximate the posterior distribution, and it is a very simple recipe to follow. From the current point:
The distribution for sampling the distance and direction is called the kernel. If the kernel is symmetric, meaning that the probability of moving forward a given distance is equal to the probability of moving backward, then the acceptance probability is just equal to the ratio of posteriors. If the kernel is not symmetric, then the acceptance probability needs to incorporate this in order to satisfy a necessary property called detailed balance.
An interactive animation of RW-MH can be found here. Select “RandomWalkMH” for the algorithm and “banana” for the target distribution. The goal of the MCMC is have a satisfactory convergence of the sample distribution, given by the histograms on the bottom and left margins, to the target distribution, given by the corresponding densities. Run the chain for a while, and try to diagnose why it might take RW-MH a long time to converge. Consider that the perfect sampler would just be drawing from the target distribution itself, so if the current point is in the left tail, it is just would be just as likely to appear in the right tail for the next sample as the left. You can also change the kernel to sample smaller or larger distances using the “Proposal \(\sigma\)” algorithm option.
Gibbs Sampling is a special case of the general class of Metropolis-Hastings algorithms, with the very nice property that samples are always accepted. It can be used when we know the conditional posterior distribution of each parameter, conditioning on the values of the other parameters. The recipe for getting a new sample from the current sample is as follows:
To understand this graphically, consider the banana distribution from the previous example, where we have two parameters. To sample a new value of the horizontal parameter, holding the vertical parameter constant, we are essentially drawing a horizontal line at the height of the vertical parameter’s current value, and sampling the horizontal parameter from that line. To sample a new value of the vertical parameter, then we are drawing a vertical line at the horizontal parameter’s current value, and sample the vertical parameter from the line. Try and visualise this for a few iterations of sampling. Can you see why it might be hard for a Gibbs sampler to efficiently explore the target distribution?
Hopefully you’ve now had a think, and tried to visualise why these two algorithms can become inefficient, taking a long time to converge to the target distribution. The big reason is that the shape of the kernel distributions of the two algorithms doesn’t fit well with the shape of the posterior. The RW-MH selects the any direction with the same probability, but we can see that some directions point to high density areas while others point to very low density areas, and the result is a lot of rejections. We can sample smaller distances and improve the acceptance rate, but this can also make the algorithm less efficient, because it takes a long time to explore the target density. The Gibbs sampler always accepts new samples, but because it can only travel horizontally and vertically, it also takes a long time to get from the bend of the banana to the tails.
In the end neither of these algorithms are smart enough to sample the banana-shaped target density. RW-MH doesn’t consider the target density’s shape, and Gibbs samplers can only move vertically and horizontally. The genius of HMC is that it uses the curvature of the target density itself to inform the kernel distribution, allowing for both a high acceptance rate and good exploration of the target density. The recipe for HMC is as follows:
To visualise HMC, consider the target density as being like a 2 dimensional surface in 3D space, and invert it so the highest probability points are depressions. Consider that a ball on the surface would roll down into the depression, but if it was first “flicked” in some random direction, then it would roll around the surface, slowing down as it moved upwards to low probability regions, and speeding up as it rolls down to the high probability regions. In our idealised world there is no friction between the ball and the surface, so the ball never settles in the high probability regions but rolls around forever. This is essentially how HMC picks its new point, with a few caveats:
Don’t get it? surfaces, balls, momentums… I thought we were fitting statistical models? A visualisation might help here! The same interactive animation as before (link) has a Hamiltonian Monte Carlo setting. Change the “Algorithm” option to “HamiltonianMC”, tick “Animate Proposal” in the Visuallisation Options, and watch the HMC select new points by moving along a trajectory. Notice how it can get from one side of the banana to the other in relatively few steps compared to the RW-MH, and it’s because the kernel is informed by the target density itself.
While Hamiltonian Monte Carlo represents a quantum leap in efficiency of MCMC compared to RW-MH and Gibbs sampling, it does have some issues that you should be aware of.
Stan actually uses a more sophisticated algorithm than HMC, but HMC is still at its core. NUTS improves on the efficiency of HMC by stopping trajectories when they start to turn back on themselves. The details of the sampler are beyond the scope of this tutorial, however it does alleviate the tuning issues mentioned previously in addition to improved efficiency. The only major related consideration for simple uses of Stan is the max_treedepth
tuning parameter, which can improve acceptance probability and efficiency per iteration if increased, however it results in a substantial increase in the computation time.
Stan is a powerful and flexible sampling package, however it is suited more to some situations than others. Stan will be useful to you if:
Some situations where Stan may NOT be helpful:
You can find a more complete overview of Hamiltonian Monte Carlo in this ArXiv paper by Radford Neal, or for a an indepth overview of the geometrical concepts and theory try Michael Bentancourt’s ArXiv paper.
For linear and generalised linear models, you can use the computational power of Stan without ever leaving the comfort of traditional R formula using the brms
R package.
The Stan website contains a huge amount of information, including examples, installation notes for the R implementation of Stan as well as others, documentation including the reference manual, and a forum for questions and debugging.
In our experience the developers at Stan have been very friendly, so if you find yourself stuck and can’t resolve the issue via the forum or reference manual, you can contact the developers to help you out.
If you want to learn more about Bayesian statistical inference, a good book to try is Bayesian Data Analysis by Andrew Gelman. If you are QUT staff or HDR, then you can find the most recent edition online via the library.
The basketball dataset was provided by Wade Hobbs from Australian Institute of Sport (AIS).
Aminath Shausan, Edgar Santos Fernandez, Daniel Kennedy (2019), Bayesian Research and Applications Group, QUT. https://danwkenn.github.io/brag-stan-guide/
The following Stan model should allow us to estimate the model parameters and forecast temperature for proceeding years:
data {
int<lower=0> N;
int<lower=2> N_new;
vector[N] y;
}
parameters {
real gamma;
real alpha;
real beta;
real<lower=0> sigma;
}
model {
for (n in 2:N)
y[n] ~ normal(alpha + beta * y[n-1] + gamma * n, sigma);
}
generated quantities{
real y_new[N_new];
y_new[1] = normal_rng(alpha + beta * y[N] + gamma * (N + 1), sigma);
for (n in 2:N_new){
y_new[n] = normal_rng(alpha + beta * y_new[n-1] + gamma * (n + N), sigma);
}
}
We now compile the model and sample for 5000 iterations on 4 parallel chains:
ts_ar1.stanfit <- readRDS(file = "data/pre-sampled-ts_ar1.RDS")
We can now use the fabulous ggmcmc
package to plot our MCMC chains and make sure that they have converged:
library(ggmcmc)
ggs_traceplot(ggs(ts_ar1.stanfit),family = c("gamma|alpha|beta|sigma"))
ggs_autocorrelation(ggs(ts_ar1.stanfit),family = c("gamma|alpha|beta|sigma"))
ggs_density(ggs(ts_ar1.stanfit),family = c("gamma|alpha|beta|sigma"))
We can get numerical summaries for your parameters as well:
print(ts_ar1.stanfit,pars = c("gamma","alpha","beta","sigma"))
## Inference for Stan model: ts_ar1.
## 4 chains, each with iter=5000; warmup=1000; thin=1;
## post-warmup draws per chain=4000, total post-warmup draws=16000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## gamma 0.03 0.00 0.01 0.01 0.03 0.03 0.04 0.05 7175 1
## alpha 44.64 0.09 6.83 31.13 40.07 44.56 49.30 57.86 5353 1
## beta 0.11 0.00 0.14 -0.16 0.02 0.11 0.20 0.38 5311 1
## sigma 1.14 0.00 0.11 0.95 1.06 1.13 1.21 1.38 6644 1
##
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 15:41:42 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We therefore find there is compelling evidence for a linear trend, where temperatures rise by 0.03(95% CI: 0.01,0.05)\(\circ\)F per year. There is also very little evidence of auto-correlation, once the linear trend is taken into account, as the credible interval for \(\beta\) covers 0 (-0.17,0.37).
We can extract and plot our forecasted temperatures using the As.mcmc.list()
function. There may be a simpler way of converting the MCMC samples to a data-frame, however using the do.call()
and lapply()
functions are used here:
predictions <- As.mcmc.list(ts_ar1.stanfit,pars = "y_new")
predictions <-
do.call(
rbind.data.frame,
lapply(1:4, function(x){return(cbind(predictions[[x]],chain = x,iter = 1:nrow(predictions[[x]])))}))
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
predictions <- predictions %>% melt(id = c("chain","iter"))
# Convert the year to a numeric variable:
levels(predictions$variable) <- 1971 + 1:length(levels(predictions$variable))
predictions$variable <- as.numeric(as.character(predictions$variable))
# Calculate the mean, upper and lower quartiles, and bounds for a 90% Prediction Interval.
prediction_summary <-
predictions %>% group_by(variable) %>%
summarise(mean = mean(value),
upper_q = quantile(value,0.75),
lower_q = quantile(value,0.25),
upper_ci = quantile(value,0.95),
lower_ci = quantile(value,0.05))
# Plot the real data:
plot <- ggplot() + geom_line(data = data.frame(y = as.vector(nhtemp), x = 1912:1971),aes(x=x,y=y))
# Plot the mean prediction
plot <- plot + geom_line(data = prediction_summary, mapping = aes(x = variable, y = mean))
# Plot the upper and lower quartiles as a ribbon:
plot <- plot + geom_ribbon(
data = prediction_summary,
mapping = aes(x = variable, ymin = lower_q, ymax = upper_q),alpha = 0.3, fill = "red")
# Plot the 90% Prediction interval as a lighter ribbon.
plot <- plot + geom_ribbon(
data = prediction_summary,
mapping = aes(x = variable, ymin = lower_ci, ymax = upper_ci),alpha = 0.2, fill = "red")
plot <- plot + labs(y = expression(Temperature ~(degree~"F")),x = "Year")
plot