# Difference between revisions of "SMHS BayesianInference"

## Scientific Methods for Health Sciences - Bayesian Inference

### Overview

Bayes’ theorem goes beyond connecting conditional and compound probability theory. It provides a method of inference where Bayes’ rule is used to update the probability estimate for a hypothesis as additional evidence is acquired. It is widely used in varieties of studies including medicine, law, science and engineering. In this section, we present a general introduction to the area of Bayesian inference and focus on its applications.

### Motivation

We are already familiar with the concepts in frequentist statistics, which focuses on drawing conclusions and making statistical inference from sample data by the emphasis on the frequency or proportion of the data. However, what if we want to make inference in a different way where we make inference after taking into consideration of the observed data. That is we are going to introduce the concept of posterior distribution and inference we can make based on this. So, how does Bayesian inference work and how is it different from the traditional frequentist statistics?

### Theory

Review our earlier discussion of Probability Theory.

#### Conditional Probability

The conditional probability of event $A$ occurring given that event $B$ occurs is $P(A│B)=(P(A\cap B))/(P(B))$. If $A$ and $B$ are independent then knowing $B$, or $B^c$, gives no information on the probability of $A$, i.e., $P(A│B)=P(A)$.

#### Multiplication rule

For any two events, $A$ and $B$, $P(A\cap B)=P(A│B)P(B)$. In general, for $n$ events $A_1, ..., A_n$: $P(A_1 \cap A_2 \cap A_3 \cap … \cap A_n ) =$ $P(A_1 )P(A_1│A_2 )P(A_3│A_1\cap A_2 ) … P(A_n│A_1\cap A_1\cap A_2\cap A_3\cap … \cap A_{n-1})$.

#### Law of total probability

If ${A_1,…,A_n}$ forms a partition the sample space $S$, then for every event $B$, the probability of $B$ to occur can be split into: $$P(B)=P(B│A_1 )P(A_1 )+P(B│A_2 )P(A_2 )+⋯ +P(B│A_n )P(A_n).$$

#### Inverting the order of conditioning

As event intersection is commutative, $P(A \cap B) = P(A | B) \times P(B) = P(B | A) \times P(A)$.

#### Bayesian Rule

Combining the properties above, for any partition the sample space $S$, ${A_1,…,A_n}$, and any two events $A$ and $B$ (i.e., subsets of $S$) we have: $$P(A | B) = {P(B | A) P(A) \over P(B)} = {P(B | A) P(A) \over P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}.$$

#### Interpretation

To interpret of some of the probability concepts above we can think of $P(A)$ as a prior distribution of event $A$, which is the initial degree of belief in $A$ (before seeing any data/evidence). $P(A|B)$ is referred to as the posterior distribution of event $A$ given (the evidence) $B$ representing the degree change of our initial belief having observed $B$.

The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events $A$ are of interest, not just two, the rule can be rephrased as posterior proportional to prior times likelihood, which can be expressed as $P(A|B) ∝ P(A)P(B|A)$, where the proportionality symbol (∝) means that the left hand side is proportional to the right hand side as $A$ varies given $B$.

Given events $A_1,A_2$ and $B$, with Bayes’ rule we have that the conditional odds of $A_1:A_2$ given $B$ are equal to the marginal odds of $A_1$ to $A_2$ (i.e., $A_1:A_2$) multiplied by the Bayes factor or likelihood ratio $Λ$. That is, $O(A_1:A_2│B)=Λ(A_1: A_2│B) \times O(A_1:A_2)$, where $Λ(A_1: A_2│B)=\frac{P(B│A_1)}{P(B│A_2)}$. The odds $O(A_1: A_2 )=\frac{P(A_1 )}{P(A_2)}$, and conditional odds $O(A_1: A_2│B)=\frac{P(A_1│B)}{P(A_2│B)}$ are known as prior odds and posterior odds respectively.

For the special case, where $A_1=A^c≡A'$, this translated into using two competing hypotheses concerning the cause of $B$. If we have $O(A)=O(A:A^c)$, the odds on A is by definition the odds for and against $A$. Bayes’ rule can then be written as $O(A│B)=O(A)Λ(A│B)$, that is the posterior odds on $A$ equals the prior odds on $A$ times the likelihood ratio for $A$ given $B$ (posterior odds equal prior odds times likelihood ratio).

#### Statistical inference

The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are:

1. collect a random sample from the population;
2. determine a probability model for the data $y_1,...,y_n$ obtained (e.g. we are collecting data on the height of US women: each women’s height can be thought as a random variable with a normal distribution). This is called the sampling model;
3. the sampling model will depend on a set of parameters, $θ$: $p(y_i |θ)$, for $i=1,...,n$;
4. using the sampling model, construct the likelihood function $L(θ; y )$;
5. determine the maximum likelihood estimate (MLE) of θ by maximizing the likelihood function $L(θ;y)$.

In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown.

The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), $θ$ of the sampling model in our learning procedure. In Bayesian statistics, before we observe the data, we have some belief regarding possible values, or range of values for the set of parameters $θ$. This knowledge is expressed in the form of a prior distribution. Since we are assigning a distribution to $θ$, we are now assuming that θ is a random variable (univariate if $θ$ is just a single parameter, multivariate if $θ$ consists of more than one parameter).

Since $θ$ is random , there is a sample space associated with $θ$. This parameter space is typically denoted by $Θ$. The parameter space $\{\theta \in \Theta \}$ is the set of all possible values for the parameter.

Once data $y_1,...,y_n$ have been collected, we specify a sampling model for the data $p(y_i│θ)$, $i=1,…,n$, which represents the probability of observing $y_i$ if we knew $θ$. In light of the data observed, using Bayes’ theorem we update our belief about $θ$ and drive the posterior distribution $p(θ│y_1,...,y_n)=p(θ|y)$, where $y=\{y_1,...,y_n\}$.

• Interpretation: Having a distribution for the unknown parameter(s) $θ$ makes it easier to understand what a point estimate and a confidence interval mean. For example: suppose the parameter of interest here is the population mean $θ$. We have collected data and we build a 95% confidence interval for $θ$. This is given by: $\hat{y} \pm 1.96 \frac{s}{\sqrt{n}}$. After the sample is collected and the interval is created, it either contains $θ$ or it does not. So, 95% is not the probability that $θ$ falls in the interval. For a frequentist, 95% is not a coverage probability, it is just a tag that informs how the interval performs over the long haul. By contrast, Bayesian confidence intervals, called credible sets, convey the posterior probability that the parameter falls in the interval.
• Likelihood Principle: Bayesian inference obeys the likelihood principle. The likelihood principle states that if two sampling models yield proportional likelihoods for $θ$, then inference about $θ$ should be identical under the two models. Frequentist inference does not always obey the likelihood principle! For example: suppose we observe the clinical outcomes in experimental treatment applied on 12 patients and at the end of the trial we observe 3 patient cured and 9 still sick. We want to test the hypothesis that the probability of successful cure is better than 50/50, i.e., $H_o$∶ $θ=1/2$ vs. $H_a$∶ $θ>1/2$. Suppose that we set up two sampling models for these data:
• Model 1: Y, number of successful treatments, as a Binomial(n=12,θ);
• Model 2: Y, number of successfully treated patients to be observed before terminating the clinical trial (in our example, getting 3 patients cured), which is a Negative Binomial(r=3,1-θ).
The two likelihoods of the 2 models are proportional but they lead to two different p-values.
• Bayesian theorem provides a rational method for learning. Suppose you collect data $y_1,...,y_n$ and compute your posterior distribution $p(θ|y_1,...,y_n)$. If we later collect an additional observation $y_{n+1}$, then, the posterior distribution $p(θ|y_1,...,y_n)$ could be used as a prior for the current Bayesian data analysis.
• Bayesian inference does not require large sample theory. Bayesian methods do not require asymptotics for valid inference. Small sample Bayesian inference proceeds in the same way as if one had a large sample.
• Finally, Bayesian inference often includes frequentist inference as a special case. One can often obtain frequentist answers by choosing a uniform prior for the parameters. In this cases, frequentist answers can be obtained from the posterior distribution, where the MLE is then the posterior mode.
• The choice of the prior distribution is a delicate issue and the main criticism to Bayesian inference. Two people analyzing the same data but using different priors could obtain different answers. When carry out a Bayesian analysis it is important to examine the prior sensitivity, that is, determine how sensitive are the results to the choice of the prior distribution.

### Bayes theorem for density functions

• Marginal density $p(y)=\int_{θ∈Θ}{p(y,θ)dθ}=\int_{θ∈Θ}{p(y|θ)p(θ)dθ}$, where $Θ$ is the parameter space. Substituting the expression for $p(y)$, Bayes’ theorem for density function can be written as $p(θ|y)=\frac{p(y|θ)p(θ)}{p(y)} = \frac{p(y|θ)p(θ)}{\int_{θ∈Θ}{p(y|θ)p(θ)dθ}}$. Given that $p(θ|y)$ is a function of $θ$, the denominator is a constant and thus, the posterior density of $θ$ is often expressed up to a proportionality constant as $p(θ|y)∝p(y|θ)p(θ)$, where the sign "∝" means “proportional to”.
The proportionality constant in the posterior density $p(θ|y)$ can be determined using the fact that the posterior density $p(θ|y)$ is a PDF. Suppose that we have observed $Y=y$, and we want to evaluate and compare the odds of two possible numerical value for $θ$, let probability (density) of $θ_a$ relative to $θ_b$ is:

$$\frac{p(θ_a│y)}{p(θ_b│y)} = \frac{\frac{p(y|θ_a )p(θ_a)}{p(y)}}{\frac{p(y|θ_b )p(θ_b)}{p(y)}}= \frac{p(y|θ_a )p(θ_a )}{p(y|θ_b )p(θ_b )},$$

where $\frac{p(θ_a )}{p(θ_b )}$ is the prior odds and $\frac{p(y|θ_a )}{p(y|θ_b )}$ is the likelihood ratio. In comparing the posterior probability of $θ_a$ and $θ_b$, it is not necessary to compute $p(y)$.

#### Prior Distributions

How to choose a prior distribution $p(θ)$? There are various type of prior distributions:

• Non-informative priors: a prior distributions $p(θ)$ is non-informative if it has minimal impact on the posterior distribution $p(θ|y)$ of $θ$. In general, a non-informative prior is one which is dominated by the likelihood, that is, it is a prior that does not change very much over the region in which the likelihood is appreciable and does not assume large values outside that range. A prior that has these properties is said to be locally uniform. Non-informative priors are also called reference priors, vague priors or flat priors.
• Examples:
• If $θ ∈ Θ = [0,1]$, then $p(θ) = Uniform(0,1)$ is a non-informative prior and $p(θ) = 1$ for $θ ∈ Θ = [0,1]$.
• If $θ ∈ Θ = R$, then if $p(θ) = Normal(μ_0,σ_0^2)$ with $σ_0^2 \longrightarrow \infty$, we get a non-informative prior. That is, we can pick $σ_0^2$ large enough so that we obtain a non-informative prior.
• Improper priors: a prior distribution $p(θ)$ on $θ ∈ Θ$ is said to be improper if $\int_{θ ∈ Θ}{p(θ)dθ}=∞$. Improper priors are often used in Bayesian inference because they yield non-informative priors. For example: if $θ ∈ Θ=R$ and $p(θ)∝1$, then $p(θ)$ is an improper prior since clearly for $\int_{θ ∈ Θ}{p(θ)dθ}=\int_{-\infty}^{+\infty}{dθ}$, which is undefined. An improper prior $p(θ)$ can result in an improper posterior distribution $p(θ|y)$. One cannot make inference with improper posterior distributions! An improper prior p(θ) may still lead to a proper posterior distribution. If $p(θ)$ is an improper prior, then the posterior distribution $p(θ|y)$ has always a particular form. In fact, if $p(θ)$ is improper than $p(θ) = c$ for some $c > 0$ and $p(θ│y )=\frac{p(y |θ)•p(θ)}{p(y)} =\frac{p(y |θ)•p(θ)}{\int_{θ∈Θ}{p(y│θ)cdθ}}=\frac{p(y |θ)•p(θ)}{\int_{θ∈Θ}{p(y│θ)cdθ}} = \frac{1}{K} p(y |θ)$, where $K=\int_{θ∈Θ}{p(y│θ)dθ}$. That is, the posterior distribution, $p(θ|y)$, is just the renormalized likelihood.
• One-parameter models: Binomial model as a start in performing Bayesian inference for binary variables. Consider data from a randomized controlled clinical trial evaluating a drug to reperfuse blocked blood vessels in patients with ischemic stroke. We indicate with $Y_i$ whether the $i^{th}$ patient treated with the drug had a favorable outcome (no disability at 90 days). Then the binary outcome variable is defined by:

$$Y_i = \begin{cases} 1, & \text{if patient i has a favorable outcome} \\ 0, & \text{otherwise} \end{cases} .$$

### Demonstration

Suppose that we have data from $n = 129$ patients. Then, since the size $N$ of the general population is much greater than the size $n$ of the sample, $Y_1,...,Y_n$ can be considered to be exchangeable random variables. Using the De Finetti’s theorem, we can set up the following model for our data: a prior distribution $p(θ)$ for a parameter $θ$, representing the proportion of 1’s in the population ($\sum_{i=1}^N{\frac{Y_i}{N}}$, given $θ$, the $Y_1,...,Y_n$ are i.i.d.’s with $p(y│θ)=θ^y (1-θ)^(1-y)$, the sampling model adopted leads to $p(y_1,...,y_n│θ)=p(y_1,...,y_129│θ)$=$\cup_{i=1}^{129}{θ^{y_i}(1-θ)^{1-y_i}}=$ $θ^{\sum_{i=1}^{129}{y_i}}(1-θ)^{129-\sum_{i=1}^{129}{y_i}}$, now we need to choose a prior distribution for $θ$.

The parameter $θ$ is a random variable with parameter space $Θ=[0,1]$. A first choice would be to use a locally uniform prior, that is a prior distribution $p(θ)$ that assigns the same probability to all intervals $(a,b) \subseteq [0,1]$ of the same length. This leads to $p(θ) \sim Uniform([0,1])$, i.e., $p(θ) = 1$ for $θ \in [0,1]$. Then, we obtain the following posterior distribution: $p(θ│y_1,...,y_{129})=\frac{p(y_1,...,y_{129}│θ)p(θ)}{p(y_1,...,y_{129})}=$ $p(y_1,...,y_{129}│θ)\times \frac{1}{p(y_1,...,y_{129})} ∝ p(y_1,...,y_{129}│θ)$ since $p(y_1,...,y_{129})$ does not depend on $θ$. Therefore, the posterior distribution $p(y_1,...,y_{129}│θ)$ of $θ$ has the same shape as $p(y_1,...,y_{129}│θ)$ but they are on a different scale.

If of the n=129 patients receiving the drug 118 had favorable outcomes and 11 did not, then we have $\sum_{i=1}^{129}{y_i}=118$, so $$p(y_1,...,y_{129}│θ)=θ^{\sum_{i=1}^{129}{y_i(1-θ)^{129-\sum_{i=1}^{129}{y_i}}}}=θ^{118} (1-θ)^{11}.$$

Definition has that a random variable $Y$ with sample space $S=[0,1]$ is said to have a Beta distribution (a,b) if its PDF is given by $p(y│a,b)=\frac{Γ(a+b)}{Γ(a)Γ(b)} y^{a-1} (1-y)^{b-1}$, with $a,b>0$. For such a random variable, we have the following properties: $E[Y]=\frac{a}{a+b}$, $Mode[Y]=\frac{a-1}{(a-1)+(b-1)}$, if $a>1$ and $b>1$, $Var[Y]=\frac{ab}{(a+b+1) (a+b)^2} =\frac{E[Y](1-E[Y])}{a+b+1}$.

To determine the scale for the posterior distribution $p(θ│y_1,...,y_n)$ we use the fact that this is a conditional density and since it is a probability density function, it needs to integrate to 1, that is: $1=\int_{θ\in Θ}{p(θ│y_1,...,y_{129})dθ}=\int_0^1{\frac{θ^{118}(1-θ)^{11}}{p(y_1,...,y_{129})} dθ}$.

The following result from calculus is useful: $\int_0^1{θ^{a-1} (1-θ)^{b-1} dθ}=\frac{Γ(a)Γ(b)}{Γ(a+b)}$, where for any $x>0$, the gamma function, $Γ(x)$, can be looked up in a table or computed using SOCR or R using command gamma(). Thus, we have $1=\int_0^1{\frac{θ^{118} (1-θ)^{11}}{p(y_1,...,y_{129})}dθ}=$ $\frac{1}{p(y_1,...,y_{129})} int_0^1{θ^{118} (1-θ)^{11} dθ}=\frac{1}{p(y_1,...,y_{129})} \frac{Γ(119)Γ(12)}{Γ(131)}$. Hence, we have that $p(θ│y_1,...,y_{129})=\frac{Γ(131)}{Γ(119)Γ(12)} θ^{118} (1-θ)^{11}$. That is, $p(θ│y_1,...,y_{129})$ has a Beta distribution with $a=119$ and $b=12$ parameters. Therefore $E[θ│y_1,...,y_{129}]=0.908$, $Mode[θ│y_1,...,y_{129}]=0.915$, and $Var[θ│y_1,...,y_{129}]=6.25E-4$.

• Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative.

#### Binomial model

Suppose the data $Y_1,...,Y_n$ represent exchangeable binary random variables (i.e., a sequence such that future samples behave like earlier samples, meaning formally that any order (of a finite number of samples) is equally likely). For our choice of prior and sampling model, we have obtained $p(θ│y_1,...,y_n) \sim Beta(1+\sum_{i=1}^n{y_i}, 1+n-\sum_{i=1}^n{y_i})$. Now, let’s suppose we want to compare the posterior odds of two values $θ_a$ and $θ_b$ of $θ$ in light of an instance of the data $y_1,...,y_n$.

The odds ratio is: $$\frac{p(θ_a│y_1,...,y_n )}{p(θ_b│y_1,...,y_n)}=\frac{\frac{Γ(n+2)}{Γ(1+\sum_{i=1}^n{y_i}) Γ(1+n-\sum_{i=1}^n{y_i})} {θ_a}^{\sum_{i=1}^n{y_i}} (1-θ_a )^{n-\sum_{i=1}^n{y_i}}} {\frac{Γ(n+2)}{Γ(1+\sum_{i=1}^n{y_i}) Γ(1+n-\sum_{i=1}^n{y_i})} {θ_b}^{\sum_{i=1}^n{y_i}} (1-θ_b)^{n-\sum_{i=1}^n{y_i}}}.$$

$$\frac{p(θ_a│y_1,...,y_n )}{p(θ_b│y_1,...,y_n)}=\frac{{θ_a}^{\sum_{i=1}^n{y_i}} (1-θ_a )^{n-\sum_{i=1}^n{y_i}}} { {θ_b}^{\sum_{i=1}^n{y_i}} (1-θ_b)^{n-\sum_{i=1}^n{y_i}}}= \left( \frac{θ_a}{θ_b}\right)^{\sum_{i=1}^n{y_i}} \left( \frac{1-θ_a}{1-θ_b}\right)^{n-\sum_{i=1}^n{y_i}}.$$

Hence, the posterior odds depend on the data only through $\sum_{i=1}^n{y_i}$. In general, this is also true for the posterior probability, since $p(θ│y_1,...,y_n)=Beta(1+\sum_{i=1}^n{y_i} ,1+n-\sum_{i=1}^n{y_i}$. In other words, $\sum_{i=1}^n{y_i}$ is a sufficient statistic.