FW 891
Click here to view presentation online
6 September 2023
see also CmdStan user’s guide
Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
# A tibble: 2 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -7.26 -6.99 0.719 0.329 -8.73 -6.75 1.00 1658. 1861.
2 theta 0.246 0.231 0.118 0.118 0.0811 0.463 1.00 1378. 1236.
Stan is a probablistic modeling language https://mc-stan.org/
Freely available
Implements HMC, and an algorithm called NUTS
The Stan documentation and community is legendary in my opinion, albeit dense at times
.stan fileCoding in Stan is something of a cross between R, WINBUGS/JAGS, and C++
It is a Turing complete programming language
Stan requires you to be explicit
;A .stan file relies on program blocks to read in your data and contruct your model
Many built in functions you can use
Why must we confront misery of a new language?
Let’s build a linear regression model, which can be written a few ways:
\[ y_{i}=\beta_{0}+\beta x_{i}+\epsilon_{i} \quad \text{where} \quad \epsilon_{i} \sim \operatorname{normal}(0, \sigma) \text {. } \]
which is the same as
\[ y_{i}-\left(\beta_{0}+\beta X_{i}\right) \sim \operatorname{normal}(0, \sigma) \]
and reducing further:
\[ y_{i} \sim \operatorname{normal}\left(\beta_{0}+\beta X_{i}, \sigma\right) . \]
Call:
lm(formula = data$y ~ data$x)
Coefficients:
(Intercept) data$x
2.2559 0.6979
\[ \color{darkorange}{y_{i}} \sim \operatorname{normal}\left(\color{#8D44AD}{\beta_{0}}+\color{#8D44AD}{\beta X_{i}}, \color{#3697DC}{\sigma}\right). \]
response = deterministic component + random component
\[ \color{darkorange}{y_{i}} \sim \operatorname{normal}\left(\color{#8D44AD}{\beta_{0}}+\color{#8D44AD}{\beta X_{i}}, \color{#3697DC}{\sigma}\right). \]
response = deterministic component + random component
\(\text { If } \mu_{i} \in \mathbb{R} \text { and } \sigma \in \mathbb{R}^{+} \text {, then for } y_{i} \in \mathbb{R} \text {, }\)
\[ \operatorname{Normal}(y_{i} \mid \mu_{i}, \sigma)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2}\left(\frac{y_{i}-\mu_{i}}{\sigma}\right)^{2}\right) \]
\[ \color{darkorange}{y_{i}} \sim \operatorname{normal}\left(\color{#8D44AD}{\beta_{0}+\beta X_{i}}, \color{#3697DC}{\sigma}\right). \]
response = deterministic component + random component
\(\text { If } \mu_{i} \in \mathbb{R} \text { and } \sigma \in \mathbb{R}^{+} \text {, then for } y_{i} \in \mathbb{R} \text {, }\)
\[ \operatorname{Normal}(y_{i} \mid \color{#8D44AD}{\mu_{i}}, \sigma)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2}\left(\frac{y_{i}-\color{#8D44AD}{\mu_{i}}}{\sigma}\right)^{2}\right) \]
\(\text{where } \color{#8D44AD}{\mu_{i}} = \color{#8D44AD}{\beta_{0}+\beta X_{i}}\)
\[ y_{i} \sim \operatorname{normal}\left(\beta_{0}+\beta X_{i}, \sigma\right) . \]
.stan modelCode to do what we are going through is in the week2/ Github directory
linreg.R and linreg.stan
.stan file// this is a comment
// program block demonstration
data{
// read in data here -- this section is executed one time per Stan run
}
transformed data {
// transform the data here -- this section is also executed one time per Stan run
}
parameters {
// declare the **estimated** parameters here
}
transformed parameters{
// this section takes parameter estimates and data (or transformed data)
// and transforms them for use later on in model section
}
model{
// this section specifies the prior(s) and likelihood terms,
// and defines a log probability function (i.e., log posterior) of the model
}
generated quantities{
// this section creates derived quantities based on parameters,
// models, data, and (optionally) pseudo-random numbers.
}As per the comments in the code, each of the program blocks does certain stuff
data{ } reads data into the .stan programtransformed data{ } runs calculations on those data (once)parameters{ } declares the estimated parameters in a Stan programtransformed parameters{ } takes the parameters, data, and transformed data, and calculates stuff you need for your modelmodel{ } constructs a log probability function:
generated quantities{ } is only executed after you have your sampled posterior
linreg.stan filedata {
int<lower=0> n; // number of observations
vector[n] y; // vector of responses
vector[n] x; // covariate x
}
parameters {
real b0;
real b1;
real<lower = 0> sd;
}
model {
// priors
b0 ~ normal(0, 10);
b1 ~ normal(0, 10);
sd ~ normal(0, 10);
// likelihood - one way:
y ~ normal(b0 + b1*x, sd); // (vectorized, dropping constant, additive terms)
}linreg.stan filedata {
int<lower=0> n; // number of observations
vector[n] y; // vector of responses
vector[n] x; // covariate x
}
parameters {
real b0;
real b1;
real<lower = 0> sd;
}
model {
// priors
b0 ~ normal(0, 10);
b1 ~ normal(0, 10);
sd ~ normal(0, 10);
// likelihood - loopy way:
for(i in 1:n){
y[i] ~ normal(b0 + b1*x[i], sd);
}
}linreg.stan filedata {
int<lower=0> n; // number of observations
vector[n] y; // vector of responses
vector[n] x; // covariate x
}
parameters {
real b0;
real b1;
real<lower = 0> sd;
}
model {
// priors
b0 ~ normal(0, 10);
b1 ~ normal(0, 10);
sd ~ normal(0, 10);
// likelihood - yet another way:
target += normal_lpdf(y | b0 + b1*x, sd); // log(normal dens) (constants included)
}lp__)
see arguments in Kery and Schaub 2012; Gelman et al. 2017; McElreath 2023
.stan codeprint() statements in Stanlinreg.R# compile the .stan model
mod <- cmdstan_model("src/linreg.stan")
# create a tagged data list
# names must correspond to data block{} in .stan
stan_data <- list(n = nrow(data), y = data$y, x = data$x)
# write a function to set starting values
inits <- function() {
list(
b0 = jitter(0, amount = 0.05),
b1 = jitter(0, amount = 1),
sd = jitter(1, amount = 0.5)
)
}linreg.Rfit <- mod$sample(
data = stan_data, # tagged stan_data list
init = inits, # `inits()` is the function here
seed = 13, # ensure simulations are reproducible
chains = 4, # multiple chains
iter_warmup = 1000, # how long to warm up the chains
iter_sampling = 1000, # how many samples after warmp
parallel_chains = 4, # run them in parallel?
refresh = 500 # print update every 500 iters
)?sampling for many other optionsRunning MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3 finished in 0.1 seconds.
Chain 4 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4 finished in 0.1 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.5 seconds.
Diagnostics for Bayesian models can be lumped into two categories:
Both types of checks are required to ensure the reliability of your inferences in a Bayesian setting
Gelman et al. 2021; McElreath 2023
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 0.9477332 1.0931448 1.0900239 1.0794640
see here for a description of runtime warnings and issues related to convergence problems
Hamiltonian based Estimated Bayesian Fraction of Missing Information (e-bfmi) quantifies how hard it is to sample level sets at each iteration
Gelman et al. 2021; McElreath 2023; Vehtari et al. 2019
Gelman et al. 2021; McElreath 2023; Vehtari et al. 2019
CmdStanFit object# extract the posterior draws
posterior <- fit$draws(format = "df") # extract draws x variables df
head(posterior)# A draws_df: 6 iterations, 1 chains, and 4 variables
lp__ b0 b1 sd
1 -40 2.3 0.68 1.16
2 -40 2.1 0.73 0.81
3 -39 1.9 0.74 1.02
4 -37 2.3 0.70 0.89
5 -37 2.2 0.70 0.96
6 -38 2.4 0.68 1.05
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
[1] 4000 7
# visualizing divergent transitions
# none here, but this plot shows each iteration as a line connecting
# parameter values, and divergent iterations will show up as red lines
# sometimes this helps you find combinations of parameters that are
# leading to divergent transitions
mcmc_parcoord(posterior,
pars = c("sd", "b0", "b1"),
transform = function(x) {
(x - mean(x)) / sd(x) # mean standardize (easier to compare)
},
np = np
)Posterior predictive checks (PPCs)
# A draws_df: 6 iterations, 1 chains, and 4 variables
lp__ b0 b1 sd
1 -40 2.3 0.68 1.16
2 -40 2.1 0.73 0.81
3 -39 1.9 0.74 1.02
4 -37 2.3 0.70 0.89
5 -37 2.2 0.70 0.96
6 -38 2.4 0.68 1.05
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
# now do it for the whole posterior
# loop through and create replicate datasets based on
# each_draw_of_posterior
set.seed(1)
y_rep <- matrix(NA, nrow = nrow(posterior), ncol = length(data$y))
for (i in 1:nrow(posterior)) {
y_rep[i, ] <- rnorm(length(data$x), b0[i] + b1[i] * data$x, sd[i])
}
dim(y_rep)[1] 4000 84
generated quantities{ } section:mod <- cmdstan_model("src/linreg_ppc.stan")
fit <- mod$sample(
data = stan_data,
init = inits,
seed = 13, # ensure simulations are reproducible
chains = 4, # multiple chains
iter_warmup = 1000, # how long to warm up the chains
iter_sampling = 1000, # how many samples after warmp
parallel_chains = 4, # run them in parallel?
refresh = 0
)Running MCMC with 4 parallel chains...
Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.2 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 finished in 0.2 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.4 seconds.
y_reps# A draws_df: 6 iterations, 1 chains, and 88 variables
lp__ b0 b1 sd y_rep[1] y_rep[2] y_rep[3] y_rep[4]
1 -37 2.3 0.71 0.93 3.1 6.5 2.49 11
2 -38 2.3 0.71 1.06 3.7 5.5 2.45 10
3 -40 2.1 0.74 0.87 3.6 4.6 2.56 11
4 -40 1.9 0.73 1.06 4.2 6.1 0.12 12
5 -37 2.4 0.67 0.92 5.7 7.2 2.63 10
6 -37 2.4 0.67 0.98 4.0 6.3 1.63 13
# ... with 80 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
y_rep “datasets”
# ppcs, another way
y_reps <- y_rep[sample(nrow(y_rep), 9), ] # draw 9 replicate datsets
ind <- sample(9, 1)
y_reps[ind, ] <- as.list(data$y) # replace a random y_rep with true y
yrep_df <- y_reps %>%
as.data.frame() %>%
pivot_longer(everything()) %>% # use the long format for plotting
mutate(name = rep(1:9, each = ncol(y_reps)))# ppcs, another way
yrep_df %>%
ggplot() +
geom_histogram(aes(x = value),
fill = "steelblue",
color = "black", binwidth = 1
) +
facet_wrap(~name, nrow = 3) +
labs(x = "", y = "") +
scale_y_continuous(breaks = NULL) +
ggtitle("Can you spot the real data?") +
theme_qfc() +
theme(text = element_text(size = 20))Betancourt 2017. A conceptual introduction to Hamiltonian Monte Carlo
Vehtari et al. 2019. Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC*
Gabry et al. 2019. Visualization in Bayesian workflow
Gelman et al. 2021. Bayesian Data Analysis. Edition 3.
Kery and Schaub. 2012. Bayesian Population Analysis using WinBUGS.
McElreath 2023. Statistical Rethinking. Second Edition.
Calculate the posterior predictive distribution one might expect for y if they went out and collected data at \(x_{i} = 1.23\)
Plot the median fit +/- 95% credible intervals from the posterior predictive distribution of y vs. \(x_{i}\). Can you think of other ways to visualize fits vs. data and the corresponding uncertainty in these fits?
Repeat the exercises in this presentation by first simulating your own linear regression dataset to prove to yourself that your code is returning reasonable answers.
Assume this dataset represents the relationship between a measure of a stream contaminant (y) and a metric of industrial development (x). Through an extensive structured decision making process it was determined by industrial representatives, resource managers, subject matter experts, and Indigenous Rightsholders that if this relationship indicated that \(\beta_{1}\) exceeded 0.71 with \(\text(Pr > 0.35)\) streamside development should be ceased. What does your analysis suggest? What are the limitations of your analysis, including things that might influence your assessment of \(\Pr(\beta_{1} > 0.71)\)? Take care to seperate explanation vs. advocacy.
Someone in the structured decision making group hates the word probability, and is pointing out that it isn’t even clear what is meant by \(\Pr(\beta_{1} > 0.71)\). Can you think of a metaphor for describing this uncertainty that does not use the word probability and which is easily understandable by folks who may lack technical training?