1 Intro

Basically, Bayesian statistics involves taking some prior beliefs and combining them with data to produce an updated (posterior) set of beliefs. In statistics-world, these beliefs are encoded as distributions (or combinations of distributions).

1.1 Probability

Bayesian and frequentist frameworks tend to differ on what “probability” means. In a Bayesian framework, probability refers to the plausibility of an event. And this tends to be how most people use the term “probability” in informal settings. If we say that the Chiefs have a 90% probability of winning a game vs the Broncos, we’re probably using “probability” in the Bayesian sense.

Frequentists, on the other hand, use probability to mean the relative frequency of a given event if it were repeated a lot of times. So in the above example, if this exact Chiefs team played this exact Broncos team in the same conditions 1,000 times, we’d expect them to win 900 of those games.

1.2 Testing Hypotheses

In a Bayesian framework, we ask this question: In light of the observed data, what’s the chance that the hypothesis is correct?

In a frequentist framework, we ask this question: If in fact the hypothesis is incorrect, what’s the chance I’d have observed this, or even more extreme, data?

(note that both of the above are from Ch 1 of Bayes Rules!)

1.3 Some Terminology

Unconditional probability: \(P(Y)\) – the probability of X (e.g. the probability that an email is spam)

Conditional probability: \(P(Y|X)\) – the probability of Y given X (e.g. the probability that an email is spam given that there’s an exclamation mark in the subject line).

In some cases, \(P(Y|X) > P(Y)\), for example \(P(orchestra | practice) > P(orchestra)\), but in other cases, \(P(Y|X) < P(Y)\), for example \(P(flu | wash hands) < P(flu)\)

Ordering is also important. Typically \(P(Y|X) \neq P(X|Y)\).

Independence: two events are independent if \(P(Y|X) = P(Y)\).

Joint probability: \(P(Y \cap X)\) probabilty of Y and X. Assuming X is a binary variable, the total probability of observing Y is: \(P(Y) = P(Y \cap X) + P(Y \cap X^c)\), where \(X^c\) refers to “not X”

1.3.1 Probability vs Likelihood

When B is known, the conditional probability function \(P(\cdot|B)\) allows us to compare the probabilities of an unknown event, A or \(A^c\), ocurring with B:

\(P(A|B)\) vs \(P(A^c|B)\)

When A is known, the likelihood function \(L(\cdot|A) = P(A|\cdot)\) allows us to evaluate the relative compatibility of data A with events B or \(B^c\):

\(L(B|A)\) vs \(L(B^c|A)\).

For example, when Y = y is known, we can use a likelihood function (\(L(\cdot |y) = f(y|\cdot)\)) to compare the relative likelihood of observing data y under possible values of \(\pi\) (in a binomial distribution), e.g. (\(L(\pi_1 | y)\) vs \(L(\pi_2 | y)\)).

1.3.2 Calculating Joint probability

\(P(A|B) = \frac{P(A \cap B)}{P(B)}\)

1.4 Bayes’ Rule

For events A and B, the posterior probability of B given A is:

\(P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{P(B)L(B|A)}{P(A)}\)

where

\(P(A) = P(B)L(B|A) + P(B^c)L(B^c|A)\)

or more generally:

\(posterior = \frac{prior \cdot likelihood}{normalizing constant}\)

Another way to think about this:

\(f(\pi | y) = \frac{f(\pi)L(\pi|y)}{f(y)}\)

1.5 Worked example

using RDatasets
using DataFrames
using Statistics
using Chain

default = dataset("ISLR", "Default")

#we'll just use default and student for this
d = default[:, [:Default, :Student]]
d.:Default .= d.:Default .== "Yes"
d.:Student .= d.:Student .== "Yes"

10000-element BitVector:
 0
 1
 0
 0
 0
 1
 0
 1
 0
 0
 1
 1
 0
 ⋮
 0
 1
 0
 0
 0
 0
 1
 0
 0
 0
 0
 1

let’s look at the overall probability of default

mean(d.:Default)

0.0333

and the overall probability of being a student

mean(d.:Student)

0.2944

and the probability of default by student type

p_tbl = @chain d begin
    groupby(:Student)
    combine(:Default => mean)
end

2×2 DataFrame

Row	Student	Default_mean
	Bool	Float64
1	false	0.029195
2	true	0.0431386

so, the likelihoods for student types are, \(L(S|D) = .043\) and \(L(S^c|D) = .029\)

If we want to figure out the probability of default for students, we can use:

\(P(S|D) = \frac{P(S)L(S|D)}{P(D)}\)

p_s_d = (mean(d.:Student) * 0.043) / (mean(d.:Default))

0.3801561561561561

So, given this, if we know someone defaults, there’s a 38% probability that they’re a student