7 Bayesian Ordinal Regression

And we’re moving onward here into ordinal regresion. As with the previous chapter, I’m using Turing.jl here to fit models, and I’m drawing lots of the ideas from Jose Storopoli’s bayesian stats notes. Some of the ideas and notation also come from McElreath’s Statistical Rethinking.

7.1 What is Ordinal Regresssion?

Probably obviously, ordinal regression is a linear model where y is ordinal. In a typical linear regression, y is interval data – the values are continuous, and the distance between adjacent values are equidistant. So like the difference between 1 and 2 is the same as the distance between 2 and 3.

When y is ordinal, we don’t assume that y is interval data. We know that 2 is “more” than 1, but we can’t know for certain that the distance between 1 and 2 is the same as the distance between 2 and 3. Likert scales and other survey measures are ordinal. So we can’t use a regular linear regression.

To fit our model, we will typically use a cumulative link function. This helps ensure that the predictions from our model will move through categories sequentially. The cumulative link function we typically (always?) want is the cumulative distribution function (CDF):

\[P(Y \le y) = \sum^{y}_{i=y_{min}}P(Y=i)\]

which gives us the probability that y is equal to a given value or any lesser value. We then use these cumulative probabilities to calculate the log-cumulative-odds:

\[logit(x) = ln(\frac{x}{1-x})\]

The logit function is the inverse of the logistic function. Recall that the logistic function maps any real number (-Inf, Inf) to the range (0, 1), so the logit function maps any number (0, 1) to (-Inf, Inf), and the resulting value is on the log-odds scale. Since we’re converting cumulative probabilities, the outputs of this transformation will be log-cumulative-odds.

We use these log-cumulative-odds to estimate intercepts for all of the possible values of our ordinal Y variable. So like the intercepts tell us the cut-points for each possible Y value.

Imagine we have an ordinal Y with range 1-4. We would estimate 3 cutpoints (intercepts), and each intercept implies a log-cumulative-odds (and, correspondingly, a cumulative probability) of scoring k (where k is 1, 2, or 3) or less. We don’t need to estimate an intercept for the largest value (4), because we know the cumulative probability will be 1 (since all values will be 4 or less).

7.2 Working through an Example

Now let’s work through an example to illustrate some of the above. We’ll also extend the discussion of the intercepts above to cases with predictor variables.

First we can load some packages and read in some data. I’ll use some wine quality data I found on kaggle.

using Turing
using Distributions
using LazyArrays
using DataFrames
using CSV
using Random: seed!
using LinearAlgebra
using Bijectors
using CategoricalArrays
using Chain
using Plots

seed!(0408)

wine = CSV.read("data/wine.csv", DataFrame)

1143×13 DataFrame

1118 rows omitted

Row	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	Id
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Int64	Int64
1	7.4	0.7	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	0
2	7.8	0.88	0.0	2.6	0.098	25.0	67.0	0.9968	3.2	0.68	9.8	5	1
3	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	5	2
4	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	6	3
5	7.4	0.7	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	4
6	7.4	0.66	0.0	1.8	0.075	13.0	40.0	0.9978	3.51	0.56	9.4	5	5
7	7.9	0.6	0.06	1.6	0.069	15.0	59.0	0.9964	3.3	0.46	9.4	5	6
8	7.3	0.65	0.0	1.2	0.065	15.0	21.0	0.9946	3.39	0.47	10.0	7	7
9	7.8	0.58	0.02	2.0	0.073	9.0	18.0	0.9968	3.36	0.57	9.5	7	8
10	6.7	0.58	0.08	1.8	0.097	15.0	65.0	0.9959	3.28	0.54	9.2	5	10
11	5.6	0.615	0.0	1.6	0.089	16.0	59.0	0.9943	3.58	0.52	9.9	5	12
12	7.8	0.61	0.29	1.6	0.114	9.0	29.0	0.9974	3.26	1.56	9.1	5	13
13	8.5	0.28	0.56	1.8	0.092	35.0	103.0	0.9969	3.3	0.75	10.5	7	16
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1132	6.1	0.715	0.1	2.6	0.053	13.0	27.0	0.99362	3.57	0.5	11.9	5	1582
1133	6.2	0.46	0.29	2.1	0.074	32.0	98.0	0.99578	3.33	0.62	9.8	5	1583
1134	6.7	0.32	0.44	2.4	0.061	24.0	34.0	0.99484	3.29	0.8	11.6	7	1584
1135	7.5	0.31	0.41	2.4	0.065	34.0	60.0	0.99492	3.34	0.85	11.4	6	1586
1136	5.8	0.61	0.11	1.8	0.066	18.0	28.0	0.99483	3.55	0.66	10.9	6	1587
1137	6.3	0.55	0.15	1.8	0.077	26.0	35.0	0.99314	3.32	0.82	11.6	6	1590
1138	5.4	0.74	0.09	1.7	0.089	16.0	26.0	0.99402	3.67	0.56	11.6	6	1591
1139	6.3	0.51	0.13	2.3	0.076	29.0	40.0	0.99574	3.42	0.75	11.0	6	1592
1140	6.8	0.62	0.08	1.9	0.068	28.0	38.0	0.99651	3.42	0.82	9.5	6	1593
1141	6.2	0.6	0.08	2.0	0.09	32.0	44.0	0.9949	3.45	0.58	10.5	5	1594
1142	5.9	0.55	0.1	2.2	0.062	39.0	51.0	0.99512	3.52	0.76	11.2	6	1595
1143	5.9	0.645	0.12	2.0	0.075	32.0	44.0	0.99547	3.57	0.71	10.2	5	1597

Let’s just use a couple of predictors – alcohol and sulphates. And we’ll use quality – an ordinal variable – as our outcome.

X = DataFrames.select(wine, [:sulphates, :alcohol])
y = wine[:, :quality]

1143-element Vector{Int64}:
 5
 5
 5
 6
 5
 5
 5
 7
 7
 5
 5
 5
 7
 ⋮
 5
 5
 7
 6
 6
 6
 6
 6
 6
 5
 6
 5

And we’ll z-score both of the predictors, since they’re numeric:

function my_z(x::Vector{Float64})
    μ = mean(skipmissing(x))
    σ = std(skipmissing(x))

    z = (x .- μ) ./ σ
    return z
end

Xz = Matrix(mapcols(x -> my_z(x), X))

1143×2 Matrix{Float64}:
 -0.573407   -0.96296
  0.130824   -0.593341
 -0.0452338  -0.593341
 -0.456035   -0.593341
 -0.573407   -0.96296
 -0.573407   -0.96296
 -1.16027    -0.96296
 -1.10158    -0.408532
 -0.514721   -0.870556
 -0.690779   -1.14777
 -0.80815    -0.500937
  5.29518    -1.24017
  0.541625    0.0534918
  ⋮          
 -0.925522    1.34716
 -0.221291   -0.593341
  0.835055    1.06994
  1.12848     0.885134
  0.0134521   0.423111
  0.952426    1.06994
 -0.573407    1.06994
  0.541625    0.515515
  0.952426   -0.870556
 -0.456035    0.0534918
  0.600311    0.700325
  0.306882   -0.223722

7.3 Exploring the intercept parameters

Before estimating the model, let’s take a few minutes and think about the intercept parameters. We’ll first look at the histogram of values of y (wine quality):

histogram(y)

And then we can get the cumulative probability that y is at value k or lower:

y_unique = sort(unique(y))
el_counts = [count(==(i), y) for i in y_unique]
el_dict = sort(Dict(zip(y_unique, el_counts)))

pr_k = values(el_dict) ./ length(y)

cum_pr_k = cumsum(pr_k)

6-element Vector{Float64}:
 0.005249343832020997
 0.03412073490813648
 0.45669291338582674
 0.8608923884514436
 0.9860017497812773
 1.0

So this tells us that 0.5% of the y values are 3 (which is the lowest rating in the data), 3.4% are 4 or lower, 45% are 5 or lower, etc.

And if we wanted to convert these into log cumulative odds, which is the metric we’re using in our model:

my_logit(x) = log(x / (1 - x))

log_cum_odds = my_logit.(cum_pr_k)

6-element Vector{Float64}:
 -5.244389024522481
 -3.343133580707394
 -0.17366349405084022
  1.8227216948320217
  4.254725791799992
 Inf

7.4 Define the model

Next we define a model just like we have previously in Turing.

@model function ordreg(X, y)
    #model setup
    n_feat = size(X, 2)
    k = maximum(y)

    #priors
    cutpoints ~ transformed(filldist(TDist(3) * 5, k - 1), Bijectors.OrderedBijector())
    β ~ filldist(TDist(3) * 2.5, n_feat)

    #likelihood
    return y ~ arraydist([OrderedLogistic(X[i, :]' * β, cutpoints) for i ∈ eachindex(y)])
end

ordreg (generic function with 2 methods)

Ok so let’s walk through this:

n_feat is the number of columns in our predictor matrix, and k is the number of unique response options in y
filldist(TDist(3) * 5, k-1) creates a k-1 length vector of prior T distributions (with df = 3, $\mu$ = 0, $\sigma$ = 5) for our cutpoints.
the cutpoints are just our intercepts, but there will be k-1 of them, and they’re in a log-cumulative-odds scale. One wrinkle here is that we’re estimating 7 cutpoints for this wine data, so we’ll estimate cutpoints for wine quality of 1 and 2 as well, even though we don’t see these values in our data (because they’re hypothetically plausible). It might also be plausible to score higher than 8 – I’m not sure – but we don’t estimate higher potential scores here.
from what I can tell, the transformed(..., OrderedBijector()) term ensures that our cutpoints are ordered appropriately
the $\beta$ term specifices the prior for our betas, just like in previous models
the final term returns an array of OrderedLogistic() distributions (arraydist()) parameterized by the linear model (X[i,:]' * $\beta$) and by the cutpoints, which recall we specified priors for the betas and the cutpoints already.

Mathematically, we can annotate this as:

$Y_i \sim Ordered-logit(\phi_i, \kappa)$

$\phi_i = X * \beta$

$\kappa_k \sim t(0, 5, 3)$ for k in $\{\kappa_1, ..., \kappa_k\}$

$\beta_j \sim t(0, 2.5, 3)$ for j in $\{\beta_1, ..., \beta_j\}$

or something like that.

7.5 Simulate the Posterior

Then we can simulate the posterior just like we did before:

model = ordreg(Xz, y)

chn = sample(model, NUTS(), MCMCThreads(), 1_000, 2)

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC C:\Users\eric_ekholm\.julia\packages\AbstractMCMC\YrmkI\src\sample.jl:310
┌ Info: Found initial step size
└   ϵ = 0.025
┌ Info: Found initial step size
└   ϵ = 0.025

Chains MCMC chain (1000×21×2 Array{Float64, 3}):
Iterations        = 501:1:1500
Number of chains  = 2
Samples per chain = 1000
Wall duration     = 174.79 seconds
Compute duration  = 172.52 seconds
parameters        = cutpoints[1], cutpoints[2], cutpoints[3], cutpoints[4], cutpoints[5], cutpoints[6], cutpoints[7], β[1], β[2]
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
    parameters      mean       std      mcse    ess_bulk    ess_tail      rhat ⋯
        Symbol   Float64   Float64   Float64     Float64     Float64   Float64 ⋯
  cutpoints[1]   -9.4595    1.8775    0.0969    477.4761    292.7390    1.0131 ⋯
  cutpoints[2]   -9.3161    1.8731    0.0976    478.2030    278.0813    1.0120 ⋯
  cutpoints[3]   -5.7834    0.4066    0.0145    766.8356   1011.2567    1.0039 ⋯
  cutpoints[4]   -3.9047    0.1685    0.0062    749.0932    978.6188    1.0020 ⋯
  cutpoints[5]   -0.3312    0.0695    0.0022   1031.5055    713.5094    1.0016 ⋯
  cutpoints[6]    2.3111    0.1028    0.0034    949.5978   1346.8880    1.0003 ⋯
  cutpoints[7]    5.0780    0.2640    0.0082   1023.6417   1100.8530    0.9995 ⋯
          β[1]    0.5180    0.0643    0.0024    751.8439    811.9974    1.0036 ⋯
          β[2]    1.0831    0.0672    0.0026    698.6998    896.6511    1.0003 ⋯
                                                                1 column omitted
Quantiles
    parameters       2.5%      25.0%     50.0%     75.0%     97.5% 
        Symbol    Float64    Float64   Float64   Float64   Float64 
  cutpoints[1]   -14.0672   -10.4862   -9.1034   -8.0498   -6.8191
  cutpoints[2]   -14.0617   -10.3529   -8.9000   -7.9473   -6.7072
  cutpoints[3]    -6.6404    -6.0596   -5.7710   -5.4798   -5.0651
  cutpoints[4]    -4.2549    -4.0120   -3.8963   -3.7869   -3.5997
  cutpoints[5]    -0.4702    -0.3764   -0.3304   -0.2830   -0.1981
  cutpoints[6]     2.1132     2.2421    2.3078    2.3792    2.5150
  cutpoints[7]     4.5942     4.9052    5.0632    5.2450    5.6346
          β[1]     0.3938     0.4742    0.5161    0.5617    0.6416
          β[2]     0.9597     1.0331    1.0798    1.1291    1.2196

We can summarize the chain here:

summarize(chn)

    parameters      mean       std      mcse    ess_bulk    ess_tail      rhat ⋯
        Symbol   Float64   Float64   Float64     Float64     Float64   Float64 ⋯
  cutpoints[1]   -9.4595    1.8775    0.0969    477.4761    292.7390    1.0131 ⋯
  cutpoints[2]   -9.3161    1.8731    0.0976    478.2030    278.0813    1.0120 ⋯
  cutpoints[3]   -5.7834    0.4066    0.0145    766.8356   1011.2567    1.0039 ⋯
  cutpoints[4]   -3.9047    0.1685    0.0062    749.0932    978.6188    1.0020 ⋯
  cutpoints[5]   -0.3312    0.0695    0.0022   1031.5055    713.5094    1.0016 ⋯
  cutpoints[6]    2.3111    0.1028    0.0034    949.5978   1346.8880    1.0003 ⋯
  cutpoints[7]    5.0780    0.2640    0.0082   1023.6417   1100.8530    0.9995 ⋯
          β[1]    0.5180    0.0643    0.0024    751.8439    811.9974    1.0036 ⋯
          β[2]    1.0831    0.0672    0.0026    698.6998    896.6511    1.0003 ⋯
                                                                1 column omitted

And this tells us that something feels off – the rhat should be between 0.99 and 1.01. Recall that rhat values near 1 imply convergence. Let’s try again, but we’ll rescale the quality values so that they range from 1-6 instead of from 3-8.

y_new = y .- 2

m2 = ordreg(Xz, y_new)
chn2 = sample(m2, NUTS(), MCMCThreads(), 1_000, 2)

┌ Warning: Only a single thread available: MCMC chains are not sampled in parallel
└ @ AbstractMCMC C:\Users\eric_ekholm\.julia\packages\AbstractMCMC\YrmkI\src\sample.jl:310
┌ Info: Found initial step size
└   ϵ = 0.05
┌ Info: Found initial step size
└   ϵ = 0.025

Chains MCMC chain (1000×19×2 Array{Float64, 3}):
Iterations        = 501:1:1500
Number of chains  = 2
Samples per chain = 1000
Wall duration     = 32.55 seconds
Compute duration  = 32.34 seconds
parameters        = cutpoints[1], cutpoints[2], cutpoints[3], cutpoints[4], cutpoints[5], β[1], β[2]
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
    parameters      mean       std      mcse    ess_bulk    ess_tail      rhat ⋯
        Symbol   Float64   Float64   Float64     Float64     Float64   Float64 ⋯
  cutpoints[1]   -5.7939    0.4019    0.0125   1065.7917    990.5757    1.0004 ⋯
  cutpoints[2]   -3.9101    0.1730    0.0039   1989.0495   1369.6629    1.0005 ⋯
  cutpoints[3]   -0.3341    0.0675    0.0015   2116.7119   1585.7946    0.9998 ⋯
  cutpoints[4]    2.3152    0.1038    0.0024   1862.9767   1748.1974    1.0016 ⋯
  cutpoints[5]    5.0784    0.2710    0.0065   1730.5648   1316.4367    1.0018 ⋯
          β[1]    0.5162    0.0625    0.0014   1919.9784   1282.6219    1.0004 ⋯
          β[2]    1.0862    0.0673    0.0015   1995.3847   1626.5862    1.0001 ⋯
                                                                1 column omitted
Quantiles
    parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
        Symbol   Float64   Float64   Float64   Float64   Float64 
  cutpoints[1]   -6.6238   -6.0497   -5.7650   -5.5075   -5.0831
  cutpoints[2]   -4.2571   -4.0201   -3.9116   -3.7947   -3.5734
  cutpoints[3]   -0.4714   -0.3804   -0.3352   -0.2884   -0.2014
  cutpoints[4]    2.1177    2.2447    2.3128    2.3844    2.5267
  cutpoints[5]    4.5720    4.8970    5.0697    5.2531    5.6201
          β[1]    0.3941    0.4729    0.5159    0.5607    0.6366
          β[2]    0.9579    1.0376    1.0861    1.1311    1.2195

summarize(chn2)

    parameters      mean       std      mcse    ess_bulk    ess_tail      rhat ⋯
        Symbol   Float64   Float64   Float64     Float64     Float64   Float64 ⋯
  cutpoints[1]   -5.7939    0.4019    0.0125   1065.7917    990.5757    1.0004 ⋯
  cutpoints[2]   -3.9101    0.1730    0.0039   1989.0495   1369.6629    1.0005 ⋯
  cutpoints[3]   -0.3341    0.0675    0.0015   2116.7119   1585.7946    0.9998 ⋯
  cutpoints[4]    2.3152    0.1038    0.0024   1862.9767   1748.1974    1.0016 ⋯
  cutpoints[5]    5.0784    0.2710    0.0065   1730.5648   1316.4367    1.0018 ⋯
          β[1]    0.5162    0.0625    0.0014   1919.9784   1282.6219    1.0004 ⋯
          β[2]    1.0862    0.0673    0.0015   1995.3847   1626.5862    1.0001 ⋯
                                                                1 column omitted

Ok, the rhats look a lot better here, so let’s move forward with this version. The summaries produced by summarize() will be in the log-odds scale, and we might want the probabilities instead:

function sigmoid(x::Float64)
    return exp(x) / (1 + exp(x))
end

@chain quantile(chn2) begin
    DataFrame
    select(_, :parameters, names(_, r"%") .=> ByRow(sigmoid); renamecols=false)
end

7×6 DataFrame

Row	parameters	2.5%	25.0%	50.0%	75.0%	97.5%
	Symbol	Float64	Float64	Float64	Float64	Float64
1	cutpoints[1]	0.0013266	0.00235307	0.00312563	0.00403995	0.00616241
2	cutpoints[2]	0.0139661	0.0176348	0.0196155	0.021994	0.0272957
3	cutpoints[3]	0.384282	0.406028	0.416978	0.428402	0.449813
4	cutpoints[4]	0.892614	0.904196	0.909934	0.915629	0.92599
5	cutpoints[5]	0.989768	0.992586	0.993755	0.994796	0.996389
6	β[1]	0.597268	0.616079	0.626189	0.636614	0.653976
7	β[2]	0.722696	0.738381	0.74765	0.75604	0.771968

So, if I understand the model correctly, it’s really difficult to interpret our betas without doing some additional simulation. But I think we can interpret the cutpoints as the probability that a wine will have a quality of k or less given a 0 value of our predictor variables – which, recall, 0 is the average, since we z-scored. So there’s a ~91% probability that a wine with average sulphates and alcohol content will score a 4 or lower.

7.6 Making Predictions

As far as I can tell, Turing doesn’t yet have functionality to generate predictions for an ordinal regression (but maybe I’m wrong?). But we can do some predictions by hand. In doing so, we can set up a little test set of 2 rows – one where our predictors are equal to 0 and one where they’re equal to 1. This will allow us to see the effect of increasing alcohol and sulphates (together) on wine quality (I’m using “effect” loosely here, since this definitely isn’t a causal model).

Basically what we want to do here is extract the mean values for our beta coefficients and our cutpoints, then estimate $\eta$ by multiplying each row of our new data by the beta coefficients. $\eta$ and the cutpoints then parameterize an OrderedLogistic() distribution, and we draw (in this case) 1,000 samples from this distribution.

betas = [mean(chn2, k) for k in MCMCChains.namesingroup(chn2, :β)]
cuts = [mean(chn2, k) for k in MCMCChains.namesingroup(chn2, :cutpoints)]

X_tst = [0.0 0.0; 1.0 1.0]

n_pred = 1_000

y0 = rand(OrderedLogistic(X_tst[1, :]' * betas, cuts), n_pred)

y1 = rand(OrderedLogistic(X_tst[end, :]' * betas, cuts), n_pred)

plot(histogram(y0, title="y0", label=""), histogram(y1, title="y1", label=""), ylims=[0, 600])

When we plot the samples, we can see that increased alcohol and sulphate values (together) tend to be associated with higher wine quality.