5 Linear Regression

We can further apply Bayesian statistics to help with estimating the parameters of a linear regression model. Here, I’m going to use Julia’s Turing.jl package to fit the model.

using Distributions
using Plots
using Turing
using LinearAlgebra
using Random
using StatsPlots
using GLM

Random.seed!(0408)

[ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]

TaskLocalRNG()

5.1 Generate Fake Data

I’ll generate some fake data to use as an example.

n = 1000

𝐗 = randn(n, 3)

β = [1.0, 2.0, 3.0]

f(x) = 0.5 .+ x * β

ϵ = rand(Normal(0, 1.0), n)

y = f(𝐗) + ϵ;

Here’s what we’re modeling:

\(Y_i \sim N(\alpha + \beta^T \bf X_i, \sigma^2)\)

In words, \(Y_i\) is normally distributed with an expected value of the output of a typical linear model and a variance of \(\sigma^2\).

5.2 Define Model

Next, we can use Turing’s @model macro to define our model. One important part of defining our model is setting priors.

In this model, I’m going to use pretty weak priors, and I’m going to use these same priors for all of the variables.

@model function lin_reg(x, y)
    n_feat = size(x, 2)

    #priors
    α ~ Normal(0, 2)
    b ~ MvNormal(zeros(n_feat), 3.0 * I)
    σ ~ Exponential(1)

    #estimate the likelihood
    for i ∈ eachindex(y)
        y[i] ~ Normal(α + x[i, :]' * b, σ)
    end

    #alternatively:
    # μ = α .+ x * b
    # return y ~ MvNormal(μ, σ * I)
end

lin_reg (generic function with 2 methods)

In math notation, our model is:

\(Y_i | \alpha, \beta, \sigma \sim N(\alpha + \bf X_i ' \beta, \sigma^2)\)

\(\alpha \sim N(0, 2)\)

\(\beta \sim N(\mu, \Sigma)\)

where \(\mu\) is a length-3 zero-vector and \(\Sigma\) is a diagonal matrix with 3s on the diagonal

\(\sigma \sim Exp(1)\)

5.3 Simulate the Posterior

Ok so now we have this model set up, which is what a fairly typical linear regression model might look like, and now we need to simulate the posterior. We can do that in Turing as follows:

model = lin_reg(𝐗, y)

chn = sample(model, NUTS(), MCMCThreads(), 5_000, 2);

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC C:\Users\eric_ekholm\.julia\packages\AbstractMCMC\Es490\src\sample.jl:307
┌ Info: Found initial step size
└   ϵ = 0.0125
┌ Info: Found initial step size
└   ϵ = 0.003125

And we can check the trace plot and the posterior distributions:

plot(chn)

These plots give some useful diagnostics. The trace plots show how the chains (in blue and yellow) are moving around the range of posterior plausible values for each parameter, and the distributions show the posterior distributions for each parameter. Since we know the true values for these parameters, we can see that simulations are producing posteriors representative of the ground truth.

5.4 Extract Information from Posterior

We can also extract summary information from our simulation. We might care about the parameter means (we probably also care about their distributions, but more on that next) and other diagnostics, which we can access via summarize().

summarize(chn)

  parameters      mean       std      mcse     ess_bulk    ess_tail      rhat  ⋯
      Symbol   Float64   Float64   Float64      Float64     Float64   Float64  ⋯
           α    0.4714    0.0321    0.0002   20468.4832   7346.0489    1.0000  ⋯
        b[1]    1.0320    0.0320    0.0002   17505.5390   7545.8046    0.9999  ⋯
        b[2]    2.0087    0.0322    0.0002   17638.6378   8224.9626    1.0000  ⋯
        b[3]    3.0360    0.0319    0.0002   21294.1279   7880.1851    1.0002  ⋯
           σ    1.0221    0.0228    0.0002   21911.5011   7597.6585    1.0000  ⋯
                                                                1 column omitted

In addition to providing parameter means, this function also gives us the rhat values for each parameter. rhat describes the ratio of the variability across all chains to the variability within any individual chain. We want this to be approximately 1, and values greater than like 1.05 might be concerning.

Note that the formula here is:

\(\hat{R} \approx \sqrt{\frac{V_{combined}}{V_{within}}}\)

Another thing we might want to do is extract quantiles from our chain. In a Bayesian context, we probably care more about quantiles/distributions than just parameter means, otherwise like why bother to adopt a Bayesian approach at all?

quantile(chn)

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 
           α    0.4080    0.4497    0.4715    0.4932    0.5344
        b[1]    0.9689    1.0104    1.0320    1.0534    1.0941
        b[2]    1.9464    1.9867    2.0087    2.0305    2.0720
        b[3]    2.9739    3.0143    3.0362    3.0575    3.0980
           σ    0.9785    1.0065    1.0219    1.0372    1.0678

By default, quantile() provides the 2.5%, 25%, 50%, 75%, and 97.5% percentiles. If we wanted, for example, 80% confidence for each parameter, we could supply this:

quantile(chn; q=[0.1, 0.9])

Quantiles
  parameters     10.0%     90.0% 
      Symbol   Float64   Float64 
           α    0.4302    0.5125
        b[1]    0.9906    1.0729
        b[2]    1.9673    2.0502
        b[3]    2.9956    3.0773
           σ    0.9932    1.0514

and if we wanted to get the quantiles for each chain separately:

qs = quantile(chn; q=[0.1, 0.9], append_chains=false)

#and let's look at just the first one
qs[1]

Quantiles (Chain 1)
  parameters     10.0%     90.0% 
      Symbol   Float64   Float64 
           α    0.4303    0.5125
        b[1]    0.9922    1.0733
        b[2]    1.9676    2.0502
        b[3]    2.9955    3.0778
           σ    0.9937    1.0510

5.5 Predicting Y

So, probably obvious, but another thing we’re probably interested in doing is predicting our outcome. In a Bayesian framework, though, these predictions won’t be a single value – they’ll be distributions. Or, rather, simulated/approximate distributions.

The way this works is that, for each observation of \(y\), we’ll get \(j\) predictions, where \(j\) is the number of parameter sets across the chains. Since we have 2 chains, each with 5,000 iterations here, we’ll get 10,000 predictions for each observation of \(y\) here.

In notation:

\(Y_i | \alpha^j, \beta^j, \sigma^j \sim N(\alpha^j + \bf X_i ' \beta^j, \sigma{^j}^2)\)

where \(j\) is a given parameter set.

So in our current case, where we have 1,000 observations of \(y\) and 2 length-5,000 chains, we will have \(1,000*2*5,000 = 10,000,000\) predicted values.

In Turing, we can get those like so:

ŷ = Vector{Union{Missing,Float64}}(undef, length(y))

preds = predict(lin_reg(𝐗, ŷ), chn);

Now, if we want to extract predictions for just our first observation (y1) and plot the distribution of predictions, we can do:

y1 = getindex(preds, "y[1]")

density(y1.data)

And we can get the mean prediction for each observation in the data if we want:

mean_preds = summarize(preds)[:, :mean]

1000-element Vector{Float64}:
  5.7006268396886695
 -1.0506371972638873
 -3.1732895813657738
  4.5049520173210675
  2.6660448198513644
 -1.2911138975566685
  2.3099148855543086
  1.1421499549064436
  0.37195589822577046
  6.828458164221361
 -3.8660133764191005
  0.48796232003761836
  2.3802240108862707
  ⋮
 -3.75015072568158
  0.7680897886182543
 -3.628382116045487
 -1.0757865278055503
  0.8049089018586542
  4.32923063569182
  1.8837023355657638
  1.0491657806017156
 -7.241138183105443
 -6.315218508659162
 -7.767339811205812
  0.7141600320377279

5.6 Compare to OLS

Since we simulated our data, we know what our true parameter values are, and so we can just kinda look at the parameter estimates from the Bayesian framework and see that they’re very close to the true values. But it could also be instructive to compare the parameter estimates – and the predicted values – to those we get from OLS.

So let’s first get our mean parameter estimates from the Bayesian framework:

bayes_coefs = summarize(chn)[:, :mean]

5-element Vector{Float64}:
 0.47143324916301155
 1.0319528048099806
 2.008675224648713
 3.036046413382201
 1.0221377365466682

Then we can get our OLS coefficients using Julia’s linear solver notation (we could also use the GLM package).

ols_coefs = hcat(ones(length(y)), 𝐗) \ y

4-element Vector{Float64}:
 0.4714517742708525
 1.0324877848634773
 2.0094769140167044
 3.03674656413514

And we can see that they’re basically the same with the isapprox() function.

isapprox(ols_coefs, bayes_coefs[1:4], atol=0.01)

true

Finally, let’s say we wanted to compare the predictions of the Bayesian model to those of the OLS model. We can calculate the mean squared error for each.

We don’t really need to do this, since if the parameters are essentially the same, we know the model predictions will also be essentially the same, but it’s easy to do so whatever.

ŷ_ols = hcat(ones(length(y)), 𝐗) * ols_coefs;

ŷ_bayes = summarize(preds)[:, :mean];

function mse(y, ŷ)
    return sum((y .- ŷ) .^ 2) / length(y)
end

mse (generic function with 1 method)

mse_ols = round(mse(y, ŷ_ols), digits=2);
mse_bayes = round(mse(y, ŷ_bayes), digits=2);

print("OLS loss: $mse_ols \nBayes loss: $mse_bayes")

OLS loss: 1.04 
Bayes loss: 1.04

So there we go – the whole process of fitting a Bayesian linear regression!