6 Bayesian Logistic Regression

In this chapter, I’m going to walk through fitting a bayesian logistic regression. This is analogous to the previous chapter where I fit a linear regression, but, uh, it’s a logistic regression.

Just like in the last chapter, I’m going to use Julia’s Turing.jl package. This example is drawn (heavily) from Jose Storopoli’s bayesian stats notes.

First, I’ll load some libraries and read in some data.

using Distributions
using Plots
using Turing
using Random
using DataFrames
using LazyArrays
using CSV

Random.seed!(0408)

#read in data
url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/wells.csv"

wells = CSV.read(download(url), DataFrame)

describe(wells)

5×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Float64	Real	Float64	Real	Int64	DataType
1	switch	0.575166	0	1.0	1	0	Int64
2	arsenic	1.65693	0.51	1.3	9.65	0	Float64
3	dist	48.3319	0.387	36.7615	339.531	0	Float64
4	assoc	0.422848	0	0.0	1	0	Int64
5	educ	4.82848	0	5.0	17	0	Int64

The outcome variable in this dataset is switch – a binary indicator of whether a household switched wells after being informed that the wells they’d been using previously were contaminated.

So a first step then is to separate the outcome from the rest of the predictors.

Note – it’s best practice to explore the data and do some feature engineering (e.g. z-scoring numeric predictors). But I’m not going to do that here because I don’t really care about the quality of the model in this contrived example.

X = Matrix(select(wells, Not(:switch)))
y = wells.:switch

3020-element Vector{Int64}:
 1
 1
 0
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1

Here’s what we’re modeling:

\(P(switch_i) = Logistic(\alpha +\beta^T \bf X_i)\)

In words – that the probability of switching wells is equal to the logistic transformation of a linear combination of \(\alpha + \beta^T \bf X_i\), which is just the linear regression model.

And recall that the logistic function is:

\(Logistic(x) = \frac{1}{1 + e^{(-x)}}\)

6.1 Define the model

Just as we did in the linear regression model, we define a logistic regression using Turin’s @model macro. Once again, this uses pretty weak priors, and each variable gets the same prior here

@model function log_reg(X, y)
    n_feat = size(X, 2)

    #priors
    α ~ Normal(0, 3)
    β ~ filldist(TDist(3), n_feat)

    #likelihood
    return y ~ arraydist(LazyArray(@~ BernoulliLogit.(α .+ X * β)))
end

log_reg (generic function with 2 methods)

In notation:

\(y \sim Bernoulli(p)\)

\(p \sim Logistic(\alpha + \bf X \cdot \beta)\)

\(\alpha \sim N(0, 3)\)

\(\beta_j \sim t(0, 1, 3)\) for j in \(\{\beta_1, ..., \beta_j\}\)

Also – the model specification above comes from Jose Storopoli’s code, and there are a few little wrinkles I didn’t quite understand at first:

filldist()is basically a loop to make the same distribution multiple times. So for example, it will create n_feat identical distributions (T distributions with 3 degrees of freedom);
arraydist() is similar to filldist(), but it is a wrapper for an array of distributions that aren’t necessarily the same.
As far as I can tell, LazyArray() and @~ provide ways to specify lazy computations, which make the operations faster and more efficient.
BernoulliLogit() is a cool way to combine the Logistic() and Bernoulli() functions.

6.2 Simulate the Posterior

Then we can simulate the posterior just like we did before:

m = log_reg(X, y)

chn = sample(m, NUTS(), MCMCThreads(), 1_000, 3);

summarize(chn)

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC C:\Users\eric_ekholm\.julia\packages\AbstractMCMC\Es490\src\sample.jl:307
┌ Info: Found initial step size
└   ϵ = 0.003125
┌ Info: Found initial step size
└   ϵ = 0.003125
┌ Info: Found initial step size
└   ϵ = 0.0125
Sampling (1 threads):  67%|████████████████████         |  ETA: 0:00:02Sampling (1 threads): 100%|█████████████████████████████| Time: 0:00:09

  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ⋯
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64   ⋯
           α   -0.1551    0.0994    0.0028   1256.7607   1606.9207    1.0026   ⋯
        β[1]    0.4671    0.0408    0.0011   1371.1125   1734.0509    1.0030   ⋯
        β[2]   -0.0090    0.0011    0.0000   3094.1331   2249.2855    1.0008   ⋯
        β[3]   -0.1230    0.0750    0.0018   1673.0766   1576.0275    1.0006   ⋯
        β[4]    0.0423    0.0096    0.0002   1800.2146   1864.5301    1.0015   ⋯
                                                                1 column omitted

6.3 Trying another Version

Ok, so just to prove that the arraydist(...) term is doing what I think it’s doing in the model above, I’m going to define a second model that’s hopefully equivalent to the first!

@model function log_reg2(X, y)
    n_feat = size(X, 2)

    #priors
    α ~ Normal(0, 3)
    β ~ filldist(TDist(3), n_feat)

    for i ∈ eachindex(y)
        y[i] ~ BernoulliLogit(α + X[i, :]' * β)
    end
end

log_reg2 (generic function with 2 methods)

m2 = log_reg2(X, y)

chn2 = sample(m2, NUTS(), MCMCThreads(), 1_000, 3);

summarize(chn2)

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC C:\Users\eric_ekholm\.julia\packages\AbstractMCMC\Es490\src\sample.jl:307
┌ Info: Found initial step size
└   ϵ = 0.0015625
┌ Info: Found initial step size
└   ϵ = 0.00625
┌ Info: Found initial step size
└   ϵ = 0.00625
Sampling (1 threads):  67%|████████████████████         |  ETA: 0:00:04Sampling (1 threads): 100%|█████████████████████████████| Time: 0:00:19

  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ⋯
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64   ⋯
           α   -0.1574    0.0988    0.0028   1292.1404   1714.7255    1.0039   ⋯
        β[1]    0.4661    0.0414    0.0011   1505.7384   1585.0674    1.0012   ⋯
        β[2]   -0.0090    0.0010    0.0000   3255.5574   2392.8436    1.0014   ⋯
        β[3]   -0.1216    0.0772    0.0019   1568.5344   1852.0284    1.0008   ⋯
        β[4]    0.0427    0.0100    0.0002   1702.9298   1573.2741    1.0022   ⋯
                                                                1 column omitted

isapprox(summarize(chn2)[:, :mean], summarize(chn)[:, :mean], atol=0.01)

true

6.4 Extracting, Predicting, Etc

The process for extracting information from the chains, predicting Y, etc. is going to be the same as it was for linear regression (since it uses the same API), so I’m not going to go through all of that here and instead just refer myself back to the linear_regression page.