using Distributions
using Plots
using Turing
using LinearAlgebra
using Random
using StatsPlots
using GLM
Random.seed!(0408)
[ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]
TaskLocalRNG()
We can further apply Bayesian statistics to help with estimating the parameters of a linear regression model. Here, I’m going to use Julia’s Turing.jl package to fit the model.
using Distributions
using Plots
using Turing
using LinearAlgebra
using Random
using StatsPlots
using GLM
Random.seed!(0408)
[ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]
TaskLocalRNG()
I’ll generate some fake data to use as an example.
= 1000
n
= randn(n, 3)
𝐗
= [1.0, 2.0, 3.0]
β
f(x) = 0.5 .+ x * β
= rand(Normal(0, 1.0), n)
ϵ
= f(𝐗) + ϵ; y
Here’s what we’re modeling:
\(Y_i \sim N(\alpha + \beta^T \bf X_i, \sigma^2)\)
In words, \(Y_i\) is normally distributed with an expected value of the output of a typical linear model and a variance of \(\sigma^2\).
Next, we can use Turing’s @model macro
to define our model. One important part of defining our model is setting priors.
In this model, I’m going to use pretty weak priors, and I’m going to use these same priors for all of the variables.
@model function lin_reg(x, y)
= size(x, 2)
n_feat
#priors
~ Normal(0, 2)
α ~ MvNormal(zeros(n_feat), 3.0 * I)
b ~ Exponential(1)
σ
#estimate the likelihood
for i ∈ eachindex(y)
~ Normal(α + x[i, :]' * b, σ)
y[i] end
#alternatively:
# μ = α .+ x * b
# return y ~ MvNormal(μ, σ * I)
end
lin_reg (generic function with 2 methods)
In math notation, our model is:
\(Y_i | \alpha, \beta, \sigma \sim N(\alpha + \bf X_i ' \beta, \sigma^2)\)
\(\alpha \sim N(0, 2)\)
\(\beta \sim N(\mu, \Sigma)\)
where \(\mu\) is a length-3 zero-vector and \(\Sigma\) is a diagonal matrix with 3s on the diagonal
\(\sigma \sim Exp(1)\)
Ok so now we have this model set up, which is what a fairly typical linear regression model might look like, and now we need to simulate the posterior. We can do that in Turing as follows:
= lin_reg(𝐗, y)
model
= sample(model, NUTS(), MCMCThreads(), 5_000, 2); chn
┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC C:\Users\eric_ekholm\.julia\packages\AbstractMCMC\Es490\src\sample.jl:307
┌ Info: Found initial step size
└ ϵ = 0.0125
┌ Info: Found initial step size
└ ϵ = 0.003125
And we can check the trace plot and the posterior distributions:
plot(chn)
These plots give some useful diagnostics. The trace plots show how the chains (in blue and yellow) are moving around the range of posterior plausible values for each parameter, and the distributions show the posterior distributions for each parameter. Since we know the true values for these parameters, we can see that simulations are producing posteriors representative of the ground truth.
We can also extract summary information from our simulation. We might care about the parameter means (we probably also care about their distributions, but more on that next) and other diagnostics, which we can access via summarize()
.
summarize(chn)
parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ α 0.4714 0.0321 0.0002 20468.4832 7346.0489 1.0000 ⋯ b[1] 1.0320 0.0320 0.0002 17505.5390 7545.8046 0.9999 ⋯ b[2] 2.0087 0.0322 0.0002 17638.6378 8224.9626 1.0000 ⋯ b[3] 3.0360 0.0319 0.0002 21294.1279 7880.1851 1.0002 ⋯ σ 1.0221 0.0228 0.0002 21911.5011 7597.6585 1.0000 ⋯ 1 column omitted
In addition to providing parameter means, this function also gives us the rhat
values for each parameter. rhat
describes the ratio of the variability across all chains to the variability within any individual chain. We want this to be approximately 1, and values greater than like 1.05 might be concerning.
Note that the formula here is:
\(\hat{R} \approx \sqrt{\frac{V_{combined}}{V_{within}}}\)
Another thing we might want to do is extract quantiles from our chain. In a Bayesian context, we probably care more about quantiles/distributions than just parameter means, otherwise like why bother to adopt a Bayesian approach at all?
quantile(chn)
Quantiles parameters 2.5% 25.0% 50.0% 75.0% 97.5% Symbol Float64 Float64 Float64 Float64 Float64 α 0.4080 0.4497 0.4715 0.4932 0.5344 b[1] 0.9689 1.0104 1.0320 1.0534 1.0941 b[2] 1.9464 1.9867 2.0087 2.0305 2.0720 b[3] 2.9739 3.0143 3.0362 3.0575 3.0980 σ 0.9785 1.0065 1.0219 1.0372 1.0678
By default, quantile()
provides the 2.5%, 25%, 50%, 75%, and 97.5% percentiles. If we wanted, for example, 80% confidence for each parameter, we could supply this:
quantile(chn; q=[0.1, 0.9])
Quantiles parameters 10.0% 90.0% Symbol Float64 Float64 α 0.4302 0.5125 b[1] 0.9906 1.0729 b[2] 1.9673 2.0502 b[3] 2.9956 3.0773 σ 0.9932 1.0514
and if we wanted to get the quantiles for each chain separately:
= quantile(chn; q=[0.1, 0.9], append_chains=false)
qs
#and let's look at just the first one
1] qs[
Quantiles (Chain 1) parameters 10.0% 90.0% Symbol Float64 Float64 α 0.4303 0.5125 b[1] 0.9922 1.0733 b[2] 1.9676 2.0502 b[3] 2.9955 3.0778 σ 0.9937 1.0510
So, probably obvious, but another thing we’re probably interested in doing is predicting our outcome. In a Bayesian framework, though, these predictions won’t be a single value – they’ll be distributions. Or, rather, simulated/approximate distributions.
The way this works is that, for each observation of \(y\), we’ll get \(j\) predictions, where \(j\) is the number of parameter sets across the chains. Since we have 2 chains, each with 5,000 iterations here, we’ll get 10,000 predictions for each observation of \(y\) here.
In notation:
\(Y_i | \alpha^j, \beta^j, \sigma^j \sim N(\alpha^j + \bf X_i ' \beta^j, \sigma{^j}^2)\)
where \(j\) is a given parameter set.
So in our current case, where we have 1,000 observations of \(y\) and 2 length-5,000 chains, we will have \(1,000*2*5,000 = 10,000,000\) predicted values.
In Turing, we can get those like so:
= Vector{Union{Missing,Float64}}(undef, length(y))
ŷ
= predict(lin_reg(𝐗, ŷ), chn); preds
Now, if we want to extract predictions for just our first observation (y1) and plot the distribution of predictions, we can do:
= getindex(preds, "y[1]")
y1
density(y1.data)
And we can get the mean prediction for each observation in the data if we want:
= summarize(preds)[:, :mean] mean_preds
1000-element Vector{Float64}:
5.7006268396886695
-1.0506371972638873
-3.1732895813657738
4.5049520173210675
2.6660448198513644
-1.2911138975566685
2.3099148855543086
1.1421499549064436
0.37195589822577046
6.828458164221361
-3.8660133764191005
0.48796232003761836
2.3802240108862707
⋮
-3.75015072568158
0.7680897886182543
-3.628382116045487
-1.0757865278055503
0.8049089018586542
4.32923063569182
1.8837023355657638
1.0491657806017156
-7.241138183105443
-6.315218508659162
-7.767339811205812
0.7141600320377279
Since we simulated our data, we know what our true parameter values are, and so we can just kinda look at the parameter estimates from the Bayesian framework and see that they’re very close to the true values. But it could also be instructive to compare the parameter estimates – and the predicted values – to those we get from OLS.
So let’s first get our mean parameter estimates from the Bayesian framework:
= summarize(chn)[:, :mean] bayes_coefs
5-element Vector{Float64}:
0.47143324916301155
1.0319528048099806
2.008675224648713
3.036046413382201
1.0221377365466682
Then we can get our OLS coefficients using Julia’s linear solver notation (we could also use the GLM
package).
= hcat(ones(length(y)), 𝐗) \ y ols_coefs
4-element Vector{Float64}:
0.4714517742708525
1.0324877848634773
2.0094769140167044
3.03674656413514
And we can see that they’re basically the same with the isapprox()
function.
isapprox(ols_coefs, bayes_coefs[1:4], atol=0.01)
true
Finally, let’s say we wanted to compare the predictions of the Bayesian model to those of the OLS model. We can calculate the mean squared error for each.
We don’t really need to do this, since if the parameters are essentially the same, we know the model predictions will also be essentially the same, but it’s easy to do so whatever.
= hcat(ones(length(y)), 𝐗) * ols_coefs;
ŷ_ols
= summarize(preds)[:, :mean];
ŷ_bayes
function mse(y, ŷ)
return sum((y .- ŷ) .^ 2) / length(y)
end
mse (generic function with 1 method)
= round(mse(y, ŷ_ols), digits=2);
mse_ols = round(mse(y, ŷ_bayes), digits=2);
mse_bayes
print("OLS loss: $mse_ols \nBayes loss: $mse_bayes")
OLS loss: 1.04
Bayes loss: 1.04
So there we go – the whole process of fitting a Bayesian linear regression!