Basically, Bayesian statistics involves taking some prior beliefs and combining them with data to produce an updated (posterior) set of beliefs. In statistics-world, these beliefs are encoded as distributions (or combinations of distributions).
1.1 Probability
Bayesian and frequentist frameworks tend to differ on what “probability” means. In a Bayesian framework, probability refers to the plausibility of an event. And this tends to be how most people use the term “probability” in informal settings. If we say that the Chiefs have a 90% probability of winning a game vs the Broncos, we’re probably using “probability” in the Bayesian sense.
Frequentists, on the other hand, use probability to mean the relative frequency of a given event if it were repeated a lot of times. So in the above example, if this exact Chiefs team played this exact Broncos team in the same conditions 1,000 times, we’d expect them to win 900 of those games.
1.2 Testing Hypotheses
In a Bayesian framework, we ask this question: In light of the observed data, what’s the chance that the hypothesis is correct?
In a frequentist framework, we ask this question: If in fact the hypothesis is incorrect, what’s the chance I’d have observed this, or even more extreme, data?
Unconditional probability: \(P(Y)\) – the probability of X (e.g. the probability that an email is spam)
Conditional probability: \(P(Y|X)\) – the probability of Y given X (e.g. the probability that an email is spam given that there’s an exclamation mark in the subject line).
In some cases, \(P(Y|X) > P(Y)\), for example \(P(orchestra | practice) > P(orchestra)\), but in other cases, \(P(Y|X) < P(Y)\), for example \(P(flu | wash hands) < P(flu)\)
Ordering is also important. Typically \(P(Y|X) \neq P(X|Y)\).
Independence: two events are independent if \(P(Y|X) = P(Y)\).
Joint probability: \(P(Y \cap X)\) probabilty of Y and X. Assuming X is a binary variable, the total probability of observing Y is: \(P(Y) = P(Y \cap X) + P(Y \cap X^c)\), where \(X^c\) refers to “not X”
1.3.1 Probability vs Likelihood
When B is known, the conditional probability function \(P(\cdot|B)\) allows us to compare the probabilities of an unknown event, A or \(A^c\), ocurring with B:
\(P(A|B)\) vs \(P(A^c|B)\)
When A is known, the likelihood function \(L(\cdot|A) = P(A|\cdot)\) allows us to evaluate the relative compatibility of data A with events B or \(B^c\):
\(L(B|A)\) vs \(L(B^c|A)\).
For example, when Y = y is known, we can use a likelihood function (\(L(\cdot |y) = f(y|\cdot)\)) to compare the relative likelihood of observing data y under possible values of \(\pi\) (in a binomial distribution), e.g. (\(L(\pi_1 | y)\) vs \(L(\pi_2 | y)\)).
1.3.2 Calculating Joint probability
\(P(A|B) = \frac{P(A \cap B)}{P(B)}\)
1.4 Bayes’ Rule
For events A and B, the posterior probability of B given A is:
usingRDatasetsusingDataFramesusingStatisticsusingChaindefault =dataset("ISLR", "Default")#we'll just use default and student for thisd = default[:, [:Default, :Student]]d.:Default .= d.:Default .=="Yes"d.:Student .= d.:Student .=="Yes"