Difference between revisions of "SMHS BayesianInference"
m (→Demonstration) |
|||
| (9 intermediate revisions by the same user not shown) | |||
| Line 11: | Line 11: | ||
====Conditional Probability==== | ====Conditional Probability==== | ||
| − | The conditional probability of event | + | The conditional probability of event <math>A</math> occurring given that event <math>B</math> occurs is <math> P(A|B) = \frac{P(A \cap B)}{P(B)} </math>. If <math>A</math> and <math>B</math> are independent then knowing <math>B</math>, or <math>B^c</math>, gives no information on the probability of <math>A</math>, i.e., <math> P(A|B)=P(A) </math>. |
====Multiplication rule==== | ====Multiplication rule==== | ||
| − | For any two events, | + | For any two events, <math>A</math> and <math>B</math>, <math> P(A\cap B)=P(A|B)P(B) </math>. In general, for <math>n</math> events <math>A_1, ..., A_n</math>: <math> P(A_1 \cap A_2 \cap A_3 \cap \cdots \cap A_n ) = P(A_1) P(A_2|A_1) P(A_3|A_1\cap A_2) \cdots P(A_n|A_1\cap A_2\cap A_3\cap \cdots \cap A_{n-1})</math>. |
====Law of total probability==== | ====Law of total probability==== | ||
| − | If | + | If <math> \{A_1,\ldots,A_n\} </math> forms a partition of the sample space <math>S</math>, then for every event <math>B</math>, the probability of <math>B</math> to occur can be split into: |
| − | + | <math>P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)+\cdots+P(B|A_n)P(A_n).</math> | |
====Inverting the order of conditioning==== | ====Inverting the order of conditioning==== | ||
| − | As event intersection is commutative, | + | As event intersection is commutative, <math> P(A \cap B) = P(A | B) \times P(B) = P(B | A) \times P(A) </math>. |
====Bayesian Rule==== | ====Bayesian Rule==== | ||
| − | Combining the properties above, for any partition the sample space | + | Combining the properties above, for any partition the sample space <math>S</math>, <math> \{A_1,\ldots,A_n\}</math>, and any two events <math>A</math> and <math>B</math> (i.e., subsets of <math>S</math>) we have: |
| − | + | <math>P(A | B) = \frac{P(B | A) P(A)}{P(B)} = \frac{P(B | A) P(A)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}.</math> | |
====Interpretation==== | ====Interpretation==== | ||
| − | To interpret of some of the probability concepts above we can think of | + | To interpret of some of the probability concepts above we can think of <math>P(A)</math> as a ''prior distribution'' of event <math>A</math>, which is the initial degree of belief in <math>A</math> (before seeing any data/evidence). <math>P(A|B)</math> is referred to as the ''posterior distribution'' of event <math>A</math> given (the evidence) <math>B</math> representing the degree change of our initial belief having observed <math>B</math>. |
| − | The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events | + | The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events <math>A</math> are of interest, not just two, the rule can be rephrased as posterior proportional to prior times likelihood, which can be expressed as <math>P(A|B) \propto P(A)P(B|A)</math>, where the proportionality symbol (<math>\propto</math>) means that the left hand side is proportional to the right hand side as <math>A</math> varies given <math>B</math>. |
| − | Given events | + | Given events <math>A_1,A_2</math> and <math>B</math>, with Bayes’ rule we have that the conditional odds of <math>A_1:A_2</math> given <math>B</math> are equal to the marginal odds of <math>A_1</math> to <math>A_2</math> (i.e., <math>A_1:A_2</math>) multiplied by the Bayes factor or likelihood ratio <math>\Lambda</math>. That is, <math>O(A_1:A_2|B)=\Lambda(A_1: A_2|B) \times O(A_1:A_2)</math>, where <math>\Lambda(A_1: A_2|B)=\frac{P(B|A_1)}{P(B|A_2)}</math>. The odds <math>O(A_1: A_2 )=\frac{P(A_1 )}{P(A_2)}</math>, and conditional odds <math>O(A_1: A_2|B)=\frac{P(A_1|B)}{P(A_2|B)}</math> are known as prior odds and posterior odds respectively. |
| − | For the special case, where | + | For the special case, where <math>A_1=A^c\equiv A'</math>, this translates into using two competing hypotheses concerning the cause of <math>B</math>. If we have <math>O(A)=O(A:A^c)</math>, the odds on A is by definition the odds for and against <math>A</math>. Bayes’ rule can then be written as <math>O(A|B)=O(A)\Lambda(A|B)</math>, that is the posterior odds on <math>A</math> equals the prior odds on <math>A</math> times the likelihood ratio for <math>A</math> given <math>B</math> (posterior odds equal prior odds times likelihood ratio). |
====Statistical inference==== | ====Statistical inference==== | ||
| − | The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are: | + | The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are: |
| − | # collect a random sample from the population; | + | # collect a random sample from the population; |
| − | # determine a probability model for the data | + | # determine a probability model for the data <math>y_1,...,y_n</math> obtained (e.g., we are collecting data on the height of US women: each woman’s height can be thought of as a random variable with a normal distribution). This is called the sampling model; |
| − | # the sampling model will depend on a set of parameters, | + | # the sampling model will depend on a set of parameters, <math>\theta</math>: <math>p(y_i | \theta)</math>, for <math>i=1,...,n</math>; |
| − | # using the sampling model, construct the likelihood function | + | # using the sampling model, construct the likelihood function <math>L(\theta; y)</math>; |
| − | # determine the [[SMHS_CIs#Maximum_likelihood_estimation_.28MLE.29|maximum likelihood estimate (MLE)]] of | + | # determine the [[SMHS_CIs#Maximum_likelihood_estimation_.28MLE.29|maximum likelihood estimate (MLE)]] of <math>\theta</math> by maximizing the likelihood function <math>L(\theta; y)</math>. |
| − | In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown. | + | In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown. |
| − | The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), | + | The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), <math>\theta</math> of the sampling model in our learning procedure. In Bayesian statistics, before we observe the data, we have some belief regarding possible values, or range of values for the set of parameters <math>\theta</math>. This knowledge is expressed in the form of a prior distribution. Since we are assigning a distribution to <math>\theta</math>, we are now assuming that <math>\theta</math> is a random variable (univariate if <math>\theta</math> is just a single parameter, multivariate if <math>\theta</math> consists of more than one parameter). |
| − | Since | + | Since <math>\theta</math> is random, there is a sample space associated with <math>\theta</math>. This parameter space is typically denoted by <math>\Theta</math>. The parameter space <math>\{\theta \in \Theta \}</math> is the set of all possible values for the parameter. |
| − | Once data | + | Once data <math>y_1,...,y_n</math> have been collected, we specify a sampling model for the data <math>p(y_i|\theta)</math>, <math>i=1,\ldots,n</math>, which represents the probability of observing <math>y_i</math> if we knew <math>\theta</math>. In light of the data observed, using Bayes’ theorem we update our belief about <math>\theta</math> and derive the posterior distribution <math>p(\theta|y_1,...,y_n)=p(\theta|y)</math>, where <math>y=\{y_1,...,y_n\}</math>. |
| − | ====Advantages of Bayesian | + | ==== Advantages of Bayesian Methods ==== |
| − | |||
| − | + | Bayesian methods offer several key advantages over frequentist approaches in statistical inference. Below, we outline these benefits, highlighting how they address limitations in frequentist methods. | |
| − | |||
| − | |||
| − | : | + | * '''Interpretation''': Bayesian inference provides a posterior distribution for the unknown parameter(s) <math>\theta</math>, which allows for direct probabilistic statements about <math>\theta</math>. For instance, suppose the parameter of interest is the population mean <math>\theta</math>. After collecting data, a Bayesian 95% credible interval directly indicates that there is a 95% posterior probability that <math>\theta</math> lies within the interval. In contrast, a frequentist 95% confidence interval, such as <math>\hat{y} \pm 1.96 \frac{s}{\sqrt{n}}</math>, does not represent the probability that <math>\theta</math> is in the interval after the data are observed—it either contains <math>\theta</math> or it does not. Instead, the 95% refers to the long-run coverage probability across repeated hypothetical samples. |
| − | * Bayesian | + | * '''Likelihood Principle''': Bayesian inference adheres to the likelihood principle, which asserts that if two different sampling models produce proportional likelihood functions for <math>\theta</math> given the observed data, then inferences about <math>\theta</math> should be identical under both models. Frequentist methods, however, can violate this principle, leading to different conclusions from the same data depending on the sampling design. For example, consider a clinical trial where we observe 9 patients cured and 3 still sick. We wish to test the null hypothesis <math>H_0: \theta = 1/2</math> against the alternative <math>H_a: \theta > 1/2</math>, where <math>\theta</math> is the probability of a successful cure (i.e., testing if the cure rate is better than 50/50). |
| − | |||
| − | |||
| − | |||
| − | |||
| − | * The | + | ** Model 1: Fixed number of trials (Binomial). |
| + | Let <math>Y_1</math> be the number of successful treatments out of 12 patients, modeled as [[SMHS_ProbabilityDistributions#Binomial_distribution|Binomial(n=12, \theta)]]. The observed data is <math>Y_1 = 9</math>. | ||
| + | |||
| + | ** Model 2: Stop when a fixed number of failures occur (Negative Binomial). | ||
| + | Let <math>Y_2</math> be the number of successful treatments observed before reaching 3 failures (sick patients), modeled as a [[SMHS_ProbabilityDistributions#Negative_binomial_distribution|Negative Binomial(r=3, \theta)]] (number of successes before r failures, with success probability <math>\theta</math>). The observed data is <math>Y_2 = 9</math>. | ||
| + | |||
| + | The likelihood functions under both models are proportional (both involve <math>\theta^9 (1-\theta)^3</math> up to a constant), yet the frequentist p-values differ: approximately 0.073 for the binomial model and 0.033 for the negative binomial model. At a 5% significance level, the negative binomial model would reject <math>H_0</math>, while the binomial model would not, despite the data being the same. | ||
| + | |||
| + | * '''Sequential Learning and Updating''': Bayes' theorem provides a coherent framework for updating beliefs with new data. If you have a posterior distribution <math>p(\theta \mid y_1, \dots, y_n)</math> from initial data <math>y_1, \dots, y_n</math>, it can serve as the prior for incorporating a new observation <math>y_{n+1}</math>, yielding an updated posterior <math>p(\theta \mid y_1, \cdots, y_{n+1})</math>. This enables rational, incremental learning without starting from scratch each time. | ||
| + | |||
| + | * '''No Reliance on Large Samples''': Unlike many frequentist methods that depend on asymptotic approximations (e.g., normal approximations for valid inference), Bayesian inference is exact for any sample size. The posterior distribution is derived directly from the likelihood and prior, making it suitable for small-sample problems where frequentist methods may fail. | ||
| + | |||
| + | * '''Incorporation of Prior Information''': Bayesian methods allow the integration of prior knowledge or expert opinion into the analysis through the prior distribution, which can be particularly useful in fields with existing domain expertise, such as medicine or engineering. | ||
| + | |||
| + | * '''Flexibility in Modeling''': Bayesian approaches handle complex, hierarchical models and missing data more naturally via techniques like Markov Chain Monte Carlo (MCMC) simulation, without needing ad-hoc adjustments common in frequentist statistics. | ||
| + | |||
| + | * '''Frequentist Methods as a Special Case''': Bayesian inference can reproduce frequentist results under certain conditions, such as using non-informative (e.g., uniform) priors. In these cases, the posterior mode often coincides with the maximum likelihood estimate (MLE), and credible intervals may align with confidence intervals. | ||
| + | |||
| + | ==== Criticisms and Considerations ==== | ||
| + | |||
| + | While Bayesian methods have many strengths, they are not without challenges: | ||
| + | |||
| + | * '''Prior Sensitivity''': The choice of prior distribution can influence results, especially with limited data. Different analysts using different priors on the same data might reach different conclusions. To mitigate this, it is essential to perform sensitivity analyses, examining how results change under alternative priors. Objective or non-informative priors (e.g., Jeffreys priors) can help reduce subjectivity in some cases. | ||
| + | |||
| + | To further enhance this section, consider adding cross-references to related wiki pages (e.g., on priors, MCMC, or specific distributions) and including illustrative figures or simulations for the examples. | ||
===Bayes theorem for density functions=== | ===Bayes theorem for density functions=== | ||
| − | * '''Marginal density''' | + | * '''Marginal density''' <math>p(y)=\int_{\theta \in \Theta}{p(y,\theta)d\theta}=\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}</math>, where <math>\Theta</math> is the parameter space. Substituting the expression for <math>p(y)</math>, Bayes’ theorem for density function can be written as <math>p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}}</math>. Given that <math>p(\theta|y)</math> is a function of <math>\theta</math>, the denominator is a constant and thus, the posterior density of <math>\theta</math> is often expressed up to a proportionality constant as <math>p(\theta|y) \propto p(y|\theta)p(\theta)</math>, where the sign <math>\propto</math> means “proportional to”. |
| + | |||
| + | : The proportionality constant in the posterior density <math>p(\theta|y)</math> can be determined using the fact that the posterior density <math>p(\theta|y)</math> is a [[SMHS_ProbabilityDistributions#Theory|PDF]]. Suppose that we have observed <math>Y=y</math>, and we want to evaluate and compare the odds of two possible numerical values for <math>\theta</math>, then the probability (density) of <math>\theta_a</math> relative to <math>\theta_b</math> is: | ||
| + | <math>\frac{p(\theta_a|y)}{p(\theta_b|y)} = \frac{\frac{p(y|\theta_a)p(\theta_a)}{p(y)}}{\frac{p(y|\theta_b)p(\theta_b)}{p(y)}}= \frac{p(y|\theta_a)p(\theta_a)}{p(y|\theta_b)p(\theta_b)},</math> | ||
| − | : | + | : where <math>\frac{p(\theta_a)}{p(\theta_b)}</math> is the prior odds and <math>\frac{p(y|\theta_a)}{p(y|\theta_b)}</math> is the likelihood ratio. In comparing the posterior probability of <math>\theta_a</math> and <math>\theta_b</math>, it is not necessary to compute <math>p(y)</math>. |
| − | |||
| − | |||
| − | |||
====Prior Distributions==== | ====Prior Distributions==== | ||
| − | How to choose a prior distribution | + | How to choose a prior distribution <math>p(\theta)</math>? There are various types of prior distributions: |
| − | * Non-informative priors: a prior | + | * Non-informative priors: a prior distribution <math>p(\theta)</math> is non-informative if it has minimal impact on the posterior distribution <math>p(\theta|y)</math> of <math>\theta</math>. In general, a non-informative prior is one which is dominated by the likelihood, that is, it is a prior that does not change very much over the region in which the likelihood is appreciable and does not assume large values outside that range. A prior that has these properties is said to be locally uniform. Non-informative priors are also called reference priors, vague priors or flat priors. |
* Examples: | * Examples: | ||
| − | ** If | + | ** If <math>\theta \in \Theta = [0,1]</math>, then <math>p(\theta) = \text{Uniform}(0,1)</math> is a non-informative prior and <math>p(\theta) = 1</math> for <math>\theta \in \Theta = [0,1]</math>. |
| − | ** If | + | ** If <math>\theta \in \Theta = \mathbb{R}</math>, then if <math>p(\theta) = \text{Normal}(\mu_0,\sigma_0^2)</math> with <math>\sigma_0^2 \longrightarrow \infty</math>, we get a non-informative prior. That is, we can pick <math>\sigma_0^2</math> large enough so that we obtain a non-informative prior. |
| − | * Improper priors: a prior distribution | + | * Improper priors: a prior distribution <math>p(\theta)</math> on <math>\theta \in \Theta</math> is said to be improper if <math>\int_{\theta \in \Theta}{p(\theta)d\theta}=\infty</math>. Improper priors are often used in Bayesian inference because they yield non-informative priors. For example: if <math>\theta \in \Theta=\mathbb{R}</math> and <math>p(\theta) \propto 1</math>, then <math>p(\theta)</math> is an improper prior since clearly <math>\int_{\theta \in \Theta}{p(\theta)d\theta}=\int_{-\infty}^{+\infty}{d\theta}</math> is undefined. An improper prior <math>p(\theta)</math> can result in an improper posterior distribution <math>p(\theta|y)</math>. One cannot make inference with improper posterior distributions! An improper prior <math>p(\theta)</math> may still lead to a proper posterior distribution. If <math>p(\theta)</math> is an improper prior, then the posterior distribution <math>p(\theta|y)</math> has always a particular form. In fact, if <math>p(\theta)</math> is improper then <math>p(\theta) = c</math> for some <math>c > 0</math> and <math>p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}}=\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}} = \frac{1}{K} p(y|\theta)</math>, where <math>K=\int_{\theta\in\Theta}{p(y|\theta)d\theta}</math>. That is, the posterior distribution, <math>p(\theta|y)</math>, is just the renormalized likelihood. |
| − | + | ====One-parameter models: Binomial model==== | |
| − | + | Consider data from a randomized controlled clinical trial evaluating a drug to reperfuse blocked blood vessels in patients with ischemic stroke. We indicate with <math>Y_i</math> whether the <math>i^{th}</math> patient treated with the drug had a favorable outcome (no disability at 90 days). Then the binary outcome variable is defined by: | |
| + | <math> | ||
Y_i = | Y_i = | ||
\begin{cases} | \begin{cases} | ||
| Line 96: | Line 114: | ||
0, & \text{otherwise} | 0, & \text{otherwise} | ||
\end{cases} | \end{cases} | ||
| − | . | + | .</math> |
===Demonstration=== | ===Demonstration=== | ||
| − | Suppose that we have data from | + | Suppose that we have data from <math>n = 129</math> patients. Then, since the size <math>N</math> of the general population is much greater than the size <math>n</math> of the sample, <math>Y_1,...,Y_n</math> can be considered to be exchangeable random variables. Using the [http://en.wikipedia.org/wiki/De_Finetti's_theorem De Finetti’s theorem], we can set up the following model for our data: a prior distribution <math>p(\theta)</math> for a parameter <math>\theta</math>, representing the proportion of 1’s in the population <math>\left( \sum_{i=1}^N{\frac{Y_i}{N}}\right)</math>, given <math>\theta</math>, the <math>Y_1,...,Y_n</math> are i.i.d.’s with <math>p(y|\theta)=\theta^y (1-\theta)^{1-y}</math>, the sampling model adopted leads to <math>p(y_1,...,y_n|\theta)=p(y_1,...,y_{129}|\theta)= \prod_{i=1}^{129}{\theta^{y_i}(1-\theta)^{1-y_i}}= \theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}</math>. Now we need to choose a prior distribution for <math>\theta</math>. |
| + | |||
| + | The parameter <math>\theta</math> is a random variable with parameter space <math>\Theta=[0,1]</math>. A first choice would be to use a locally uniform prior, that is a prior distribution <math>p(\theta)</math> that assigns the same probability to all intervals <math>(a,b) \subseteq [0,1]</math> of the same length. This leads to <math>p(\theta) \sim \text{Uniform}([0,1])</math>, i.e., <math>p(\theta) = 1</math> for <math>\theta \in [0,1]</math>. Then, we obtain the following posterior distribution: <math>p(\theta|y_1,...,y_{129})=\frac{p(y_1,...,y_{129}|\theta)p(\theta)}{p(y_1,...,y_{129})}= p(y_1,...,y_{129}|\theta)\times \frac{1}{p(y_1,...,y_{129})} \propto p(y_1,...,y_{129}|\theta)</math> since <math>p(y_1,...,y_{129})</math> does not depend on <math>\theta</math>. Therefore, the posterior distribution <math>p(\theta|y_1,...,y_{129})</math> has the same shape as <math>p(y_1,...,y_{129}|\theta)</math> but they are on a different scale. | ||
| − | + | If of the <math>n=129</math> patients receiving the drug 118 had favorable outcomes and 11 did not, then we have <math>\sum_{i=1}^{129}{y_i}=118</math>, so | |
| − | + | <math>p(y_1,...,y_{129}|\theta)=\theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}=\theta^{118} (1-\theta)^{11}.</math> | |
| − | If of the n=129 patients receiving the drug 118 had favorable outcomes and 11 did not, then we have | ||
| − | + | Recall that a random variable <math>Y</math> with sample space <math>S=[0,1]</math> is said to have a [[AP_Statistics_Curriculum_2007_Beta|Beta distribution (a,b)]] if its PDF is given by <math>p(y|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} y^{a-1} (1-y)^{b-1}</math>, with <math>a,b>0</math>. For such a random variable, we have the following properties: <math>E[Y]=\frac{a}{a+b}</math>, <math>Mode[Y]=\frac{a-1}{(a-1)+(b-1)}</math>, if <math>a>1</math> and <math>b>1</math>, <math>Var[Y]=\frac{ab}{(a+b+1) (a+b)^2} =\frac{E[Y](1-E[Y])}{a+b+1}</math>. | |
| − | To determine the scale for the posterior distribution | + | To determine the scale for the posterior distribution <math>p(\theta|y_1,...,y_n)</math> we use the fact that this is a conditional density and since it is a probability density function, it needs to integrate to 1, that is: |
| + | <math>1=\int_{\theta\in \Theta}{p(\theta|y_1,...,y_{129})d\theta}=\int_0^1{\frac{\theta^{118}(1-\theta)^{11}}{p(y_1,...,y_{129})} d\theta}.</math> | ||
| − | The following result from calculus is useful: | + | The following result from calculus is useful: |
| − | int_0^1{ | + | <math>\int_0^1{\theta^{a-1} (1-\theta)^{b-1} d\theta}=\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}</math>, |
| + | where for any <math>x>0</math>, the gamma function, <math>\Gamma(x)</math>, can be looked up in a table or computed using SOCR or R using command ''gamma()''. Thus, we have | ||
| + | <math>1=\int_0^1{\frac{\theta^{118} (1-\theta)^{11}}{p(y_1,...,y_{129})}d\theta}= \frac{1}{p(y_1,...,y_{129})} \int_0^1{\theta^{118} (1-\theta)^{11} d\theta}=\frac{1}{p(y_1,...,y_{129})} \frac{\Gamma(119)\Gamma(12)}{\Gamma(131)}.</math> | ||
| + | Hence, we have that | ||
| + | <math>p(\theta|y_1,...,y_{129})=\frac{\Gamma(131)}{\Gamma(119)\Gamma(12)} \theta^{118} (1-\theta)^{11}.</math> | ||
| + | That is, <math>p(\theta|y_1,...,y_{129})</math> has a [[AP_Statistics_Curriculum_2007_Beta|Beta distribution]] with <math>a=119</math> and <math>b=12</math> parameters. Therefore <math>E[\theta|y_1,...,y_{129}]=0.908</math>, <math>Mode[\theta|y_1,...,y_{129}]=0.915</math>, and <math>Var[\theta|y_1,...,y_{129}]=6.25\times 10^{-4}</math>. | ||
* Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative. | * Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative. | ||
====Binomial model==== | ====Binomial model==== | ||
| − | Suppose the data | + | Suppose the data <math>Y_1,...,Y_n</math> represent exchangeable binary random variables (i.e., a sequence such that future samples behave like earlier samples, meaning formally that any order (of a finite number of samples) is equally likely). For our choice of prior and sampling model, we have obtained <math>p(\theta|y_1,...,y_n) \sim \text{Beta}(1+\sum_{i=1}^n{y_i}, 1+n-\sum_{i=1}^n{y_i})</math>. Now, let’s suppose we want to compare the posterior odds of two values <math>\theta_a</math> and <math>\theta_b</math> of <math>\theta</math> in light of an instance of the data <math>y_1,...,y_n</math>. |
The odds ratio is: | The odds ratio is: | ||
| − | + | <math>\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}} | |
| − | {\frac{ | + | {\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}.</math> |
| − | + | Simplifying, we get: | |
| − | { { | + | <math>\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{{\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}} |
| − | \left( \frac{ | + | { {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}= |
| − | \left( \frac{1- | + | \left( \frac{\theta_a}{\theta_b}\right)^{\sum_{i=1}^n{y_i}} |
| + | \left( \frac{1-\theta_a}{1-\theta_b}\right)^{n-\sum_{i=1}^n{y_i}}.</math> | ||
| − | Hence, the posterior odds depend on the data only through | + | Hence, the posterior odds depend on the data only through <math>\sum_{i=1}^n{y_i}</math>. In general, this is also true for the posterior probability, since <math>p(\theta|y_1,...,y_n)=\text{Beta}(1+\sum_{i=1}^n{y_i} ,1+n-\sum_{i=1}^n{y_i})</math>. In other words, <math>\sum_{i=1}^n{y_i}</math> is a [http://en.wikipedia.org/wiki/Sufficient_statistic sufficient statistic]. |
===Sufficient Statistics=== | ===Sufficient Statistics=== | ||
| − | If | + | If <math>Y_1,...,Y_n</math> are random variables, a statistic <math>T(Y_1,...,Y_n)</math> is a ''sufficient'' statistic if it captures all the information about <math>\theta</math> that is contained in the sample. If <math>Y_1,...,Y_n</math> are binary random variables that are conditionally independently and identically distributed with <math>p(y|\theta)</math> a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process|Bernoulli distribution]] with parameter <math>\theta</math>, then <math>\sum_{i=1}^n{y_i}</math> is a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Binomial_Random_Variables|Binomial random variable]] and its density is <math>p(y|\theta)={n \choose y} \theta^y (1-\theta)^{n-y}</math>, for <math>y\in \{0,1,\ldots,n\}</math>. |
| − | What would be the posterior distribution of | + | What would be the posterior distribution of <math>\theta</math> if instead of working directly with the random variables <math>Y_1,...,Y_n</math>, we work with the sufficient statistic <math>Y=\sum_{i=1}^n{y_i}</math>? |
| − | Here, | + | Here, <math>\theta\in \Theta=[0,1]</math> with prior distribution <math>p(\theta)</math>. Let’s use again a uniform prior <math>p(\theta)=1</math> for <math>\theta</math>, then |
| − | where | + | <math>p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{{n\choose y} \theta^y (1-\theta)^{n-y}\times 1}{p(y)} \propto \theta^y (1-\theta)^{n-y}</math>, for <math>\theta\in [0,1]</math>. |
| + | We can determine again the proportionality constant by using the fact that <math>p(\theta|y)</math> is a density and thus it needs to integrate to 1. We can see that the part of <math>p(\theta|y)</math> that depends on <math>\theta</math> has the same functional form as the kernel of a Beta distribution: | ||
| + | <math>\text{Beta}(x;a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1} (1-x)^{b-1},</math> | ||
| + | where <math>x\in [0,1]</math>, and <math>a</math> and <math>b</math> are positive. Since the kernel of <math>p(\theta|y)</math> matches that of a <math>\text{Beta}(a,b)</math> with <math>a=1+y</math> and <math>b=1+n-y</math>, the constant needed in <math>p(\theta|y)</math> is <math>\frac{\Gamma(n+2)}{\Gamma(1+y)\Gamma(1+n-y)}</math>. | ||
| − | Summarizing, we have the distribution of | + | Summarizing, we have the distribution of <math>\theta</math> is <math>p(\theta)=1</math>, <math>\theta\in [0,1]</math> and <math>Y \sim p(y|\theta)=\text{Binomial}(n, \theta)</math>. This implies that <math>p(\theta|y)=\text{Beta}(1+y,1+n-y)</math>. |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | + | If <math>\theta \sim p(\theta)=\text{Beta}(a,b)</math>, <math>Y|\theta \sim p(y|\theta)=\text{Binomial}(n, \theta)</math>, and we want to report a point estimate for <math>\theta</math> based on the data, we can then report its posterior mean. Since <math>p(\theta|y)=\text{Beta}(\tilde{a}=a+y,\tilde{b}=b+n-y)</math>, using the formulas for the [[AP_Statistics_Curriculum_2007_Beta|mean of a Beta(a,b) random variable]], we have that the '''posterior mean''' is: | |
| + | <math>E(\theta|y)=\frac{\tilde{a}}{\tilde{a}+\tilde{b}}=</math> | ||
| + | <math>\frac{a+y}{(a+y)+(b+n-y)}=\frac{a+y}{a+b+n}=\frac{a+b}{a+b+n}\times | ||
| + | \frac{a}{a+b}+\frac{n}{a+b+n}\frac{y}{n}=</math> | ||
| + | <math>\frac{a+b}{a+b+n}\times | ||
| + | \text{ prior mean} +\frac{n}{n+a+b}\times \text{sample average}.</math> | ||
| + | The posterior mean is a weighted average of the prior mean and the sample average. The weight of the prior mean, <math>\frac{a+b}{a+b+n}</math> is a ratio of the prior sample size to the sum of the prior sample size and the sample size of the data. If <math>n \gg a+b</math> (prior sample size), then the weight of the prior mean becomes rather small and the majority of the information about <math>\theta</math> comes from the data and not from the prior. In this case: <math>E[\theta|y]\approx \frac{y}{n}=\hat{y}</math>. The mode of the posterior distribution <math>p(\theta|y)=\text{Beta}(a+y,b+n-y)</math> is given by <math>Mode(\theta|y)=\frac{a+y-1}{a+b+n-2}</math>. Also here note that if the sample size n is very large, the posterior mode is approximately equal to <math>\frac{y}{n}</math>, the sample average. | ||
| − | The posterior distribution | + | The posterior distribution <math>p(\theta|y)</math> can also be summarized by providing intervals for <math>\theta</math> of the form <math>[l(y),u(y)]</math> that have a high posterior probability of containing <math>\theta</math>. For example, <math>P(\theta\in [l(y),u(y)]|y)=0.95</math> has a 95% Bayesian coverage or is a 95% credible interval for <math>\theta</math>. To compute the 95% CI for <math>\theta</math> by taking the 2.5<sup>th</sup> percentile and 97.5<sup>th</sup> percentile of <math>\text{Beta}(a+y,b+n-y)</math>. For the above example with <math>n=129</math> and <math>y=118</math>, we have: |
<center> | <center> | ||
| − | {| class="wikitable" style="text-align:center;"border="1" | + | {| class="wikitable" style="text-align:center;" border="1" |
|- | |- | ||
| − | ! Prior parameters | + | ! Prior parameters <math>(a,b)</math> || 95% Credible Interval for <math>\theta</math> |
|- | |- | ||
| (1,1) || [0.845, 0.951] | | (1,1) || [0.845, 0.951] | ||
| Line 171: | Line 199: | ||
* [http://www.sciencemag.org/content/294/5550/2310.short This article] discusses about Bayesian inference of Phylogeny and its impact on evolutionary biology. As a discipline, phylogenetics is becoming transformed by a flood of molecular data. These data allow broad questions to be asked about the history of life, but also present difficult statistical and computational problems. Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences. | * [http://www.sciencemag.org/content/294/5550/2310.short This article] discusses about Bayesian inference of Phylogeny and its impact on evolutionary biology. As a discipline, phylogenetics is becoming transformed by a flood of molecular data. These data allow broad questions to be asked about the history of life, but also present difficult statistical and computational problems. Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences. | ||
| − | ===Software === | + | ===Software=== |
| − | + | <pre> | |
| − | + | ## 3D and 2D plots of the posterior probability that theta > some value w given the data in the | |
| − | + | ## Beta-Binomial model | |
| − | + | ## The function that makes 3D plot is the persp function | |
| − | + | ## For help and documentation, type ?persp at the prompt in R | |
| − | + | ## The function persp takes as arguments: a x vector that gives the values on the x axis (this has | |
| − | + | ## to be in increasing order), a y vector with values on the y axis (also in increasing order), | |
| − | + | ## and a z matrix of dimension equal to the length of the x vector times the length of the y vector. | |
| − | + | ## The z matrix has at the (i,j) entry the value of the variable to be displayed on the third dimension | |
| − | + | ## corresponding to the i-th value of the x-vector and the j-th value of the y-vector, that is z[i,j] is | |
| + | ## the value of the z variable at x[i] and y[j]. | ||
a <- seq(0,10,by=0.1) | a <- seq(0,10,by=0.1) | ||
| Line 214: | Line 243: | ||
persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > 0.8 | y)",main="P(theta > 0.8 | y)", theta=30,phi=20) | persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > 0.8 | y)",main="P(theta > 0.8 | y)", theta=30,phi=20) | ||
| − | |||
| − | === Problems=== | + | image(post.prob.theta) |
| + | </pre> | ||
| + | |||
| + | ===Problems=== | ||
====Sample survey==== | ====Sample survey==== | ||
| − | Suppose we are going to sample 100 individuals from a country (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let | + | Suppose we are going to sample 100 individuals from a country (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let <math>Y_i=1</math> if person <math>i</math> in the sample supports the policy and 0 otherwise. |
| − | * Assume | + | * Assume <math>Y_1,\ldots,Y_{100}</math> are conditional on <math>\theta</math>, IID binary random variables with expectation <math>\theta</math>. Write down the joint distribution of <math>P(Y_1=y_1,Y_2=y_2,\ldots,Y_{100}=y_{100}|\theta)</math> in a compact form. Also write down the form of <math>P(\sum{Y_i=y}|\theta)</math>. |
| − | * Suppose you believed that | + | * Suppose you believed that <math>\theta\in \{0.0,0.1,\ldots,0.9,1.0\}</math>. Given that the results of the survey were <math>\sum_{i=1}^{100}{Y_i} =57</math>, compute <math>P(\sum_{i=1}^{100}{Y_i} =57|\theta)</math> for each of these 11 values of <math>\theta</math> and plot these probabilities as a function of <math>\theta</math>. |
| − | * Now suppose you originally had no prior information to believe one of these | + | * Now suppose you originally had no prior information to believe one of these <math>\theta</math>-values over another, and so <math>P(\theta = 0.0) = P(\theta = 0.1) = \cdots = P(\theta=0.9)=P(\theta= 1.0)</math>. Use Bayes’ rule to compute <math>p(\theta| \sum_{i=1}^n{Y_i} = 57)</math> for each <math>\theta</math>-value. Make a plot of this posterior distribution as a function of <math>\theta</math>. |
| − | * Next, suppose you allow | + | * Next, suppose you allow <math>\theta</math> to be any value in the interval <math>[0,1]</math>. Using the uniform prior density for <math>\theta</math>, so that <math>p(\theta) = 1</math>, plot the posterior density <math>p(\theta)\times P(\sum_{i=1}^n{Y_i} = 57|\theta)</math> as a function of <math>\theta</math>. |
| − | * As discussed above, the posterior distribution of | + | * As discussed above, the posterior distribution of <math>\theta</math> is <math>\text{Beta}(1+57,1+100-57)</math>. Plot the posterior density as a function of <math>\theta</math>. Discuss the relationships among all of the plots you have made for this exercise. |
===References=== | ===References=== | ||
* [http://en.wikipedia.org/wiki/Bayesian_inference Wikipedia Bayesian Inference] | * [http://en.wikipedia.org/wiki/Bayesian_inference Wikipedia Bayesian Inference] | ||
| − | + | * [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019178 Christou N, Dinov, ID (2011) Confidence Interval Based Parameter Estimation—A New SOCR Applet and Activity. PLoS ONE 6(5): e19178. doi:10.1371/journal.pone.0019178]. | |
<hr> | <hr> | ||
| − | * SOCR Home page: | + | * SOCR Home page: https://socr.umich.edu |
| − | {{translate|pageName= | + | {{translate|pageName=https://wiki.socr.umich.edu/index.php?title=SMHS_BayesianInference}} |
Latest revision as of 09:16, 10 February 2026
Contents
Scientific Methods for Health Sciences - Bayesian Inference
Overview
Bayes’ theorem goes beyond connecting conditional and compound probability theory. It provides a method of inference where Bayes’ rule is used to update the probability estimate for a hypothesis as additional evidence is acquired. It is widely used in varieties of studies including medicine, law, science and engineering. In this section, we present a general introduction to the area of Bayesian inference and focus on its applications.
Motivation
We are already familiar with the concepts in frequentist statistics, which focuses on drawing conclusions and making statistical inference from sample data by the emphasis on the frequency or proportion of the data. However, what if we want to make inference in a different way where we make inference after taking into consideration of the observed data. That is we are going to introduce the concept of posterior distribution and inference we can make based on this. So, how does Bayesian inference work and how is it different from the traditional frequentist statistics?
Theory
Review our earlier discussion of Probability Theory.
Conditional Probability
The conditional probability of event \(A\) occurring given that event \(B\) occurs is \( P(A|B) = \frac{P(A \cap B)}{P(B)} \). If \(A\) and \(B\) are independent then knowing \(B\), or \(B^c\), gives no information on the probability of \(A\), i.e., \( P(A|B)=P(A) \).
Multiplication rule
For any two events, \(A\) and \(B\), \( P(A\cap B)=P(A|B)P(B) \). In general, for \(n\) events \(A_1, ..., A_n\)\[ P(A_1 \cap A_2 \cap A_3 \cap \cdots \cap A_n ) = P(A_1) P(A_2|A_1) P(A_3|A_1\cap A_2) \cdots P(A_n|A_1\cap A_2\cap A_3\cap \cdots \cap A_{n-1})\].
Law of total probability
If \( \{A_1,\ldots,A_n\} \) forms a partition of the sample space \(S\), then for every event \(B\), the probability of \(B\) to occur can be split into\[P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)+\cdots+P(B|A_n)P(A_n).\]
Inverting the order of conditioning
As event intersection is commutative, \( P(A \cap B) = P(A | B) \times P(B) = P(B | A) \times P(A) \).
Bayesian Rule
Combining the properties above, for any partition the sample space \(S\), \( \{A_1,\ldots,A_n\}\), and any two events \(A\) and \(B\) (i.e., subsets of \(S\)) we have\[P(A | B) = \frac{P(B | A) P(A)}{P(B)} = \frac{P(B | A) P(A)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}.\]
Interpretation
To interpret of some of the probability concepts above we can think of \(P(A)\) as a prior distribution of event \(A\), which is the initial degree of belief in \(A\) (before seeing any data/evidence). \(P(A|B)\) is referred to as the posterior distribution of event \(A\) given (the evidence) \(B\) representing the degree change of our initial belief having observed \(B\).
The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events \(A\) are of interest, not just two, the rule can be rephrased as posterior proportional to prior times likelihood, which can be expressed as \(P(A|B) \propto P(A)P(B|A)\), where the proportionality symbol (\(\propto\)) means that the left hand side is proportional to the right hand side as \(A\) varies given \(B\).
Given events \(A_1,A_2\) and \(B\), with Bayes’ rule we have that the conditional odds of \(A_1:A_2\) given \(B\) are equal to the marginal odds of \(A_1\) to \(A_2\) (i.e., \(A_1:A_2\)) multiplied by the Bayes factor or likelihood ratio \(\Lambda\). That is, \(O(A_1:A_2|B)=\Lambda(A_1: A_2|B) \times O(A_1:A_2)\), where \(\Lambda(A_1: A_2|B)=\frac{P(B|A_1)}{P(B|A_2)}\). The odds \(O(A_1: A_2 )=\frac{P(A_1 )}{P(A_2)}\), and conditional odds \(O(A_1: A_2|B)=\frac{P(A_1|B)}{P(A_2|B)}\) are known as prior odds and posterior odds respectively.
For the special case, where \(A_1=A^c\equiv A'\), this translates into using two competing hypotheses concerning the cause of \(B\). If we have \(O(A)=O(A:A^c)\), the odds on A is by definition the odds for and against \(A\). Bayes’ rule can then be written as \(O(A|B)=O(A)\Lambda(A|B)\), that is the posterior odds on \(A\) equals the prior odds on \(A\) times the likelihood ratio for \(A\) given \(B\) (posterior odds equal prior odds times likelihood ratio).
Statistical inference
The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are:
- collect a random sample from the population;
- determine a probability model for the data \(y_1,...,y_n\) obtained (e.g., we are collecting data on the height of US women: each woman’s height can be thought of as a random variable with a normal distribution). This is called the sampling model;
- the sampling model will depend on a set of parameters, \(\theta\)\[p(y_i | \theta)\], for \(i=1,...,n\);
- using the sampling model, construct the likelihood function \(L(\theta; y)\);
- determine the maximum likelihood estimate (MLE) of \(\theta\) by maximizing the likelihood function \(L(\theta; y)\).
In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown.
The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), \(\theta\) of the sampling model in our learning procedure. In Bayesian statistics, before we observe the data, we have some belief regarding possible values, or range of values for the set of parameters \(\theta\). This knowledge is expressed in the form of a prior distribution. Since we are assigning a distribution to \(\theta\), we are now assuming that \(\theta\) is a random variable (univariate if \(\theta\) is just a single parameter, multivariate if \(\theta\) consists of more than one parameter).
Since \(\theta\) is random, there is a sample space associated with \(\theta\). This parameter space is typically denoted by \(\Theta\). The parameter space \(\{\theta \in \Theta \}\) is the set of all possible values for the parameter.
Once data \(y_1,...,y_n\) have been collected, we specify a sampling model for the data \(p(y_i|\theta)\), \(i=1,\ldots,n\), which represents the probability of observing \(y_i\) if we knew \(\theta\). In light of the data observed, using Bayes’ theorem we update our belief about \(\theta\) and derive the posterior distribution \(p(\theta|y_1,...,y_n)=p(\theta|y)\), where \(y=\{y_1,...,y_n\}\).
Advantages of Bayesian Methods
Bayesian methods offer several key advantages over frequentist approaches in statistical inference. Below, we outline these benefits, highlighting how they address limitations in frequentist methods.
- Interpretation: Bayesian inference provides a posterior distribution for the unknown parameter(s) \(\theta\), which allows for direct probabilistic statements about \(\theta\). For instance, suppose the parameter of interest is the population mean \(\theta\). After collecting data, a Bayesian 95% credible interval directly indicates that there is a 95% posterior probability that \(\theta\) lies within the interval. In contrast, a frequentist 95% confidence interval, such as \(\hat{y} \pm 1.96 \frac{s}{\sqrt{n}}\), does not represent the probability that \(\theta\) is in the interval after the data are observed—it either contains \(\theta\) or it does not. Instead, the 95% refers to the long-run coverage probability across repeated hypothetical samples.
- Likelihood Principle: Bayesian inference adheres to the likelihood principle, which asserts that if two different sampling models produce proportional likelihood functions for \(\theta\) given the observed data, then inferences about \(\theta\) should be identical under both models. Frequentist methods, however, can violate this principle, leading to different conclusions from the same data depending on the sampling design. For example, consider a clinical trial where we observe 9 patients cured and 3 still sick. We wish to test the null hypothesis \(H_0: \theta = 1/2\) against the alternative \(H_a: \theta > 1/2\), where \(\theta\) is the probability of a successful cure (i.e., testing if the cure rate is better than 50/50).
** Model 1: Fixed number of trials (Binomial).
Let \(Y_1\) be the number of successful treatments out of 12 patients, modeled as Binomial(n=12, \theta). The observed data is \(Y_1 = 9\).
** Model 2: Stop when a fixed number of failures occur (Negative Binomial).
Let \(Y_2\) be the number of successful treatments observed before reaching 3 failures (sick patients), modeled as a Negative Binomial(r=3, \theta) (number of successes before r failures, with success probability \(\theta\)). The observed data is \(Y_2 = 9\).
The likelihood functions under both models are proportional (both involve \(\theta^9 (1-\theta)^3\) up to a constant), yet the frequentist p-values differ: approximately 0.073 for the binomial model and 0.033 for the negative binomial model. At a 5% significance level, the negative binomial model would reject \(H_0\), while the binomial model would not, despite the data being the same.
- Sequential Learning and Updating: Bayes' theorem provides a coherent framework for updating beliefs with new data. If you have a posterior distribution \(p(\theta \mid y_1, \dots, y_n)\) from initial data \(y_1, \dots, y_n\), it can serve as the prior for incorporating a new observation \(y_{n+1}\), yielding an updated posterior \(p(\theta \mid y_1, \cdots, y_{n+1})\). This enables rational, incremental learning without starting from scratch each time.
- No Reliance on Large Samples: Unlike many frequentist methods that depend on asymptotic approximations (e.g., normal approximations for valid inference), Bayesian inference is exact for any sample size. The posterior distribution is derived directly from the likelihood and prior, making it suitable for small-sample problems where frequentist methods may fail.
- Incorporation of Prior Information: Bayesian methods allow the integration of prior knowledge or expert opinion into the analysis through the prior distribution, which can be particularly useful in fields with existing domain expertise, such as medicine or engineering.
- Flexibility in Modeling: Bayesian approaches handle complex, hierarchical models and missing data more naturally via techniques like Markov Chain Monte Carlo (MCMC) simulation, without needing ad-hoc adjustments common in frequentist statistics.
- Frequentist Methods as a Special Case: Bayesian inference can reproduce frequentist results under certain conditions, such as using non-informative (e.g., uniform) priors. In these cases, the posterior mode often coincides with the maximum likelihood estimate (MLE), and credible intervals may align with confidence intervals.
Criticisms and Considerations
While Bayesian methods have many strengths, they are not without challenges:
- Prior Sensitivity: The choice of prior distribution can influence results, especially with limited data. Different analysts using different priors on the same data might reach different conclusions. To mitigate this, it is essential to perform sensitivity analyses, examining how results change under alternative priors. Objective or non-informative priors (e.g., Jeffreys priors) can help reduce subjectivity in some cases.
To further enhance this section, consider adding cross-references to related wiki pages (e.g., on priors, MCMC, or specific distributions) and including illustrative figures or simulations for the examples.
Bayes theorem for density functions
- Marginal density \(p(y)=\int_{\theta \in \Theta}{p(y,\theta)d\theta}=\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}\), where \(\Theta\) is the parameter space. Substituting the expression for \(p(y)\), Bayes’ theorem for density function can be written as \(p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}}\). Given that \(p(\theta|y)\) is a function of \(\theta\), the denominator is a constant and thus, the posterior density of \(\theta\) is often expressed up to a proportionality constant as \(p(\theta|y) \propto p(y|\theta)p(\theta)\), where the sign \(\propto\) means “proportional to”.
- The proportionality constant in the posterior density \(p(\theta|y)\) can be determined using the fact that the posterior density \(p(\theta|y)\) is a PDF. Suppose that we have observed \(Y=y\), and we want to evaluate and compare the odds of two possible numerical values for \(\theta\), then the probability (density) of \(\theta_a\) relative to \(\theta_b\) is\[\frac{p(\theta_a|y)}{p(\theta_b|y)} = \frac{\frac{p(y|\theta_a)p(\theta_a)}{p(y)}}{\frac{p(y|\theta_b)p(\theta_b)}{p(y)}}= \frac{p(y|\theta_a)p(\theta_a)}{p(y|\theta_b)p(\theta_b)},\]
- where \(\frac{p(\theta_a)}{p(\theta_b)}\) is the prior odds and \(\frac{p(y|\theta_a)}{p(y|\theta_b)}\) is the likelihood ratio. In comparing the posterior probability of \(\theta_a\) and \(\theta_b\), it is not necessary to compute \(p(y)\).
Prior Distributions
How to choose a prior distribution \(p(\theta)\)? There are various types of prior distributions:
- Non-informative priors: a prior distribution \(p(\theta)\) is non-informative if it has minimal impact on the posterior distribution \(p(\theta|y)\) of \(\theta\). In general, a non-informative prior is one which is dominated by the likelihood, that is, it is a prior that does not change very much over the region in which the likelihood is appreciable and does not assume large values outside that range. A prior that has these properties is said to be locally uniform. Non-informative priors are also called reference priors, vague priors or flat priors.
- Examples:
- If \(\theta \in \Theta = [0,1]\), then \(p(\theta) = \text{Uniform}(0,1)\) is a non-informative prior and \(p(\theta) = 1\) for \(\theta \in \Theta = [0,1]\).
- If \(\theta \in \Theta = \mathbb{R}\), then if \(p(\theta) = \text{Normal}(\mu_0,\sigma_0^2)\) with \(\sigma_0^2 \longrightarrow \infty\), we get a non-informative prior. That is, we can pick \(\sigma_0^2\) large enough so that we obtain a non-informative prior.
- Improper priors: a prior distribution \(p(\theta)\) on \(\theta \in \Theta\) is said to be improper if \(\int_{\theta \in \Theta}{p(\theta)d\theta}=\infty\). Improper priors are often used in Bayesian inference because they yield non-informative priors. For example: if \(\theta \in \Theta=\mathbb{R}\) and \(p(\theta) \propto 1\), then \(p(\theta)\) is an improper prior since clearly \(\int_{\theta \in \Theta}{p(\theta)d\theta}=\int_{-\infty}^{+\infty}{d\theta}\) is undefined. An improper prior \(p(\theta)\) can result in an improper posterior distribution \(p(\theta|y)\). One cannot make inference with improper posterior distributions! An improper prior \(p(\theta)\) may still lead to a proper posterior distribution. If \(p(\theta)\) is an improper prior, then the posterior distribution \(p(\theta|y)\) has always a particular form. In fact, if \(p(\theta)\) is improper then \(p(\theta) = c\) for some \(c > 0\) and \(p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}}=\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}} = \frac{1}{K} p(y|\theta)\), where \(K=\int_{\theta\in\Theta}{p(y|\theta)d\theta}\). That is, the posterior distribution, \(p(\theta|y)\), is just the renormalized likelihood.
One-parameter models: Binomial model
Consider data from a randomized controlled clinical trial evaluating a drug to reperfuse blocked blood vessels in patients with ischemic stroke. We indicate with \(Y_i\) whether the \(i^{th}\) patient treated with the drug had a favorable outcome (no disability at 90 days). Then the binary outcome variable is defined by\[ Y_i = \begin{cases} 1, & \text{if patient i has a favorable outcome} \\ 0, & \text{otherwise} \end{cases} .\]
Demonstration
Suppose that we have data from \(n = 129\) patients. Then, since the size \(N\) of the general population is much greater than the size \(n\) of the sample, \(Y_1,...,Y_n\) can be considered to be exchangeable random variables. Using the De Finetti’s theorem, we can set up the following model for our data: a prior distribution \(p(\theta)\) for a parameter \(\theta\), representing the proportion of 1’s in the population \(\left( \sum_{i=1}^N{\frac{Y_i}{N}}\right)\), given \(\theta\), the \(Y_1,...,Y_n\) are i.i.d.’s with \(p(y|\theta)=\theta^y (1-\theta)^{1-y}\), the sampling model adopted leads to \(p(y_1,...,y_n|\theta)=p(y_1,...,y_{129}|\theta)= \prod_{i=1}^{129}{\theta^{y_i}(1-\theta)^{1-y_i}}= \theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}\). Now we need to choose a prior distribution for \(\theta\).
The parameter \(\theta\) is a random variable with parameter space \(\Theta=[0,1]\). A first choice would be to use a locally uniform prior, that is a prior distribution \(p(\theta)\) that assigns the same probability to all intervals \((a,b) \subseteq [0,1]\) of the same length. This leads to \(p(\theta) \sim \text{Uniform}([0,1])\), i.e., \(p(\theta) = 1\) for \(\theta \in [0,1]\). Then, we obtain the following posterior distribution\[p(\theta|y_1,...,y_{129})=\frac{p(y_1,...,y_{129}|\theta)p(\theta)}{p(y_1,...,y_{129})}= p(y_1,...,y_{129}|\theta)\times \frac{1}{p(y_1,...,y_{129})} \propto p(y_1,...,y_{129}|\theta)\] since \(p(y_1,...,y_{129})\) does not depend on \(\theta\). Therefore, the posterior distribution \(p(\theta|y_1,...,y_{129})\) has the same shape as \(p(y_1,...,y_{129}|\theta)\) but they are on a different scale.
If of the \(n=129\) patients receiving the drug 118 had favorable outcomes and 11 did not, then we have \(\sum_{i=1}^{129}{y_i}=118\), so \(p(y_1,...,y_{129}|\theta)=\theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}=\theta^{118} (1-\theta)^{11}.\)
Recall that a random variable \(Y\) with sample space \(S=[0,1]\) is said to have a Beta distribution (a,b) if its PDF is given by \(p(y|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} y^{a-1} (1-y)^{b-1}\), with \(a,b>0\). For such a random variable, we have the following properties\[E[Y]=\frac{a}{a+b}\], \(Mode[Y]=\frac{a-1}{(a-1)+(b-1)}\), if \(a>1\) and \(b>1\), \(Var[Y]=\frac{ab}{(a+b+1) (a+b)^2} =\frac{E[Y](1-E[Y])}{a+b+1}\).
To determine the scale for the posterior distribution \(p(\theta|y_1,...,y_n)\) we use the fact that this is a conditional density and since it is a probability density function, it needs to integrate to 1, that is\[1=\int_{\theta\in \Theta}{p(\theta|y_1,...,y_{129})d\theta}=\int_0^1{\frac{\theta^{118}(1-\theta)^{11}}{p(y_1,...,y_{129})} d\theta}.\]
The following result from calculus is useful\[\int_0^1{\theta^{a-1} (1-\theta)^{b-1} d\theta}=\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\], where for any \(x>0\), the gamma function, \(\Gamma(x)\), can be looked up in a table or computed using SOCR or R using command gamma(). Thus, we have \(1=\int_0^1{\frac{\theta^{118} (1-\theta)^{11}}{p(y_1,...,y_{129})}d\theta}= \frac{1}{p(y_1,...,y_{129})} \int_0^1{\theta^{118} (1-\theta)^{11} d\theta}=\frac{1}{p(y_1,...,y_{129})} \frac{\Gamma(119)\Gamma(12)}{\Gamma(131)}.\) Hence, we have that \(p(\theta|y_1,...,y_{129})=\frac{\Gamma(131)}{\Gamma(119)\Gamma(12)} \theta^{118} (1-\theta)^{11}.\) That is, \(p(\theta|y_1,...,y_{129})\) has a Beta distribution with \(a=119\) and \(b=12\) parameters. Therefore \(E[\theta|y_1,...,y_{129}]=0.908\), \(Mode[\theta|y_1,...,y_{129}]=0.915\), and \(Var[\theta|y_1,...,y_{129}]=6.25\times 10^{-4}\).
- Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative.
Binomial model
Suppose the data \(Y_1,...,Y_n\) represent exchangeable binary random variables (i.e., a sequence such that future samples behave like earlier samples, meaning formally that any order (of a finite number of samples) is equally likely). For our choice of prior and sampling model, we have obtained \(p(\theta|y_1,...,y_n) \sim \text{Beta}(1+\sum_{i=1}^n{y_i}, 1+n-\sum_{i=1}^n{y_i})\). Now, let’s suppose we want to compare the posterior odds of two values \(\theta_a\) and \(\theta_b\) of \(\theta\) in light of an instance of the data \(y_1,...,y_n\).
The odds ratio is\[\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}} {\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}.\]
Simplifying, we get\[\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{{\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}} { {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}= \left( \frac{\theta_a}{\theta_b}\right)^{\sum_{i=1}^n{y_i}} \left( \frac{1-\theta_a}{1-\theta_b}\right)^{n-\sum_{i=1}^n{y_i}}.\]
Hence, the posterior odds depend on the data only through \(\sum_{i=1}^n{y_i}\). In general, this is also true for the posterior probability, since \(p(\theta|y_1,...,y_n)=\text{Beta}(1+\sum_{i=1}^n{y_i} ,1+n-\sum_{i=1}^n{y_i})\). In other words, \(\sum_{i=1}^n{y_i}\) is a sufficient statistic.
Sufficient Statistics
If \(Y_1,...,Y_n\) are random variables, a statistic \(T(Y_1,...,Y_n)\) is a sufficient statistic if it captures all the information about \(\theta\) that is contained in the sample. If \(Y_1,...,Y_n\) are binary random variables that are conditionally independently and identically distributed with \(p(y|\theta)\) a Bernoulli distribution with parameter \(\theta\), then \(\sum_{i=1}^n{y_i}\) is a Binomial random variable and its density is \(p(y|\theta)={n \choose y} \theta^y (1-\theta)^{n-y}\), for \(y\in \{0,1,\ldots,n\}\).
What would be the posterior distribution of \(\theta\) if instead of working directly with the random variables \(Y_1,...,Y_n\), we work with the sufficient statistic \(Y=\sum_{i=1}^n{y_i}\)?
Here, \(\theta\in \Theta=[0,1]\) with prior distribution \(p(\theta)\). Let’s use again a uniform prior \(p(\theta)=1\) for \(\theta\), then \(p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{{n\choose y} \theta^y (1-\theta)^{n-y}\times 1}{p(y)} \propto \theta^y (1-\theta)^{n-y}\), for \(\theta\in [0,1]\). We can determine again the proportionality constant by using the fact that \(p(\theta|y)\) is a density and thus it needs to integrate to 1. We can see that the part of \(p(\theta|y)\) that depends on \(\theta\) has the same functional form as the kernel of a Beta distribution\[\text{Beta}(x;a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1} (1-x)^{b-1},\] where \(x\in [0,1]\), and \(a\) and \(b\) are positive. Since the kernel of \(p(\theta|y)\) matches that of a \(\text{Beta}(a,b)\) with \(a=1+y\) and \(b=1+n-y\), the constant needed in \(p(\theta|y)\) is \(\frac{\Gamma(n+2)}{\Gamma(1+y)\Gamma(1+n-y)}\).
Summarizing, we have the distribution of \(\theta\) is \(p(\theta)=1\), \(\theta\in [0,1]\) and \(Y \sim p(y|\theta)=\text{Binomial}(n, \theta)\). This implies that \(p(\theta|y)=\text{Beta}(1+y,1+n-y)\).
If \(\theta \sim p(\theta)=\text{Beta}(a,b)\), \(Y|\theta \sim p(y|\theta)=\text{Binomial}(n, \theta)\), and we want to report a point estimate for \(\theta\) based on the data, we can then report its posterior mean. Since \(p(\theta|y)=\text{Beta}(\tilde{a}=a+y,\tilde{b}=b+n-y)\), using the formulas for the mean of a Beta(a,b) random variable, we have that the posterior mean is\[E(\theta|y)=\frac{\tilde{a}}{\tilde{a}+\tilde{b}}=\] \(\frac{a+y}{(a+y)+(b+n-y)}=\frac{a+y}{a+b+n}=\frac{a+b}{a+b+n}\times \frac{a}{a+b}+\frac{n}{a+b+n}\frac{y}{n}=\) \(\frac{a+b}{a+b+n}\times \text{ prior mean} +\frac{n}{n+a+b}\times \text{sample average}.\)
The posterior mean is a weighted average of the prior mean and the sample average. The weight of the prior mean, \(\frac{a+b}{a+b+n}\) is a ratio of the prior sample size to the sum of the prior sample size and the sample size of the data. If \(n \gg a+b\) (prior sample size), then the weight of the prior mean becomes rather small and the majority of the information about \(\theta\) comes from the data and not from the prior. In this case\[E[\theta|y]\approx \frac{y}{n}=\hat{y}\]. The mode of the posterior distribution \(p(\theta|y)=\text{Beta}(a+y,b+n-y)\) is given by \(Mode(\theta|y)=\frac{a+y-1}{a+b+n-2}\). Also here note that if the sample size n is very large, the posterior mode is approximately equal to \(\frac{y}{n}\), the sample average.
The posterior distribution \(p(\theta|y)\) can also be summarized by providing intervals for \(\theta\) of the form \([l(y),u(y)]\) that have a high posterior probability of containing \(\theta\). For example, \(P(\theta\in [l(y),u(y)]|y)=0.95\) has a 95% Bayesian coverage or is a 95% credible interval for \(\theta\). To compute the 95% CI for \(\theta\) by taking the 2.5th percentile and 97.5th percentile of \(\text{Beta}(a+y,b+n-y)\). For the above example with \(n=129\) and \(y=118\), we have:
| Prior parameters \((a,b)\) | 95% Credible Interval for \(\theta\) |
|---|---|
| (1,1) | [0.845, 0.951] |
| (2,1) | [0.855, 0.952] |
| (3,1) | [0.856, 0.952] |
| (1,2) | [0.845, 0.946] |
| (2,2) | [0.847, 0.947] |
Applications
- This article presents a generalization of Bayesian inference. Procedures of statistical inference are described which generalize Bayesian inference in specific ways. Probability is used in such a way that in general only bounds may be placed on the probabilities of given events, and probability systems of this kind are suggested both for sample information and for prior information. These systems are then combined using a specified rule. Illustrations are given for inferences about trinomial probabilities, and for inferences about a monotone sequence of binomial pi. Finally, some comments are made on the general class of models which produce upper and lower probabilities, and on the specific models which underlie the suggested inference procedures.
- This article discusses about Bayesian inference of Phylogeny and its impact on evolutionary biology. As a discipline, phylogenetics is becoming transformed by a flood of molecular data. These data allow broad questions to be asked about the history of life, but also present difficult statistical and computational problems. Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences.
Software
## 3D and 2D plots of the posterior probability that theta > some value w given the data in the
## Beta-Binomial model
## The function that makes 3D plot is the persp function
## For help and documentation, type ?persp at the prompt in R
## The function persp takes as arguments: a x vector that gives the values on the x axis (this has
## to be in increasing order), a y vector with values on the y axis (also in increasing order),
## and a z matrix of dimension equal to the length of the x vector times the length of the y vector.
## The z matrix has at the (i,j) entry the value of the variable to be displayed on the third dimension
## corresponding to the i-th value of the x-vector and the j-th value of the y-vector, that is z[i,j] is
## the value of the z variable at x[i] and y[j].
a <- seq(0,10,by=0.1)
b <- seq(0,10,by=0.1)
w <- 0.8
n <- 129
y <- 118
## Plotting the posterior distribution
for(i in 1:length(a)){
for(j in 1:length(b)){
plot(seq(0,1,by=0.01),dbeta(seq(0,1,by=0.01),a[i]+y,b[j]+(n-y)),type="l",col="black",lty=2,lwd=1,
xlim=c(0,1),xlab="Theta",ylab="P(theta | y)")
title(main=paste("Posterior distribution a=",a[i]," b=",b[j],sep=""))
abline(v=0.8,lty=3,col="red")
}
}
post.prob.theta <- matrix(0,length(a),length(b))
for(i in 1:length(a)){
for(j in 1:length(b)){
post.a <- a[i]+y
post.b <- b[j]+(n-y)
post.prob.theta[i,j] <- 1-pbeta(w,post.a,post.b)
}
}
persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > w | y)")
# To rotate the figure, you can specify the value of the angles theta and phi by which to tilt the figure
persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > 0.8 | y)",main="P(theta > 0.8 | y)", theta=30,phi=20)
image(post.prob.theta)
Problems
Sample survey
Suppose we are going to sample 100 individuals from a country (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let \(Y_i=1\) if person \(i\) in the sample supports the policy and 0 otherwise.
- Assume \(Y_1,\ldots,Y_{100}\) are conditional on \(\theta\), IID binary random variables with expectation \(\theta\). Write down the joint distribution of \(P(Y_1=y_1,Y_2=y_2,\ldots,Y_{100}=y_{100}|\theta)\) in a compact form. Also write down the form of \(P(\sum{Y_i=y}|\theta)\).
- Suppose you believed that \(\theta\in \{0.0,0.1,\ldots,0.9,1.0\}\). Given that the results of the survey were \(\sum_{i=1}^{100}{Y_i} =57\), compute \(P(\sum_{i=1}^{100}{Y_i} =57|\theta)\) for each of these 11 values of \(\theta\) and plot these probabilities as a function of \(\theta\).
- Now suppose you originally had no prior information to believe one of these \(\theta\)-values over another, and so \(P(\theta = 0.0) = P(\theta = 0.1) = \cdots = P(\theta=0.9)=P(\theta= 1.0)\). Use Bayes’ rule to compute \(p(\theta| \sum_{i=1}^n{Y_i} = 57)\) for each \(\theta\)-value. Make a plot of this posterior distribution as a function of \(\theta\).
- Next, suppose you allow \(\theta\) to be any value in the interval \([0,1]\). Using the uniform prior density for \(\theta\), so that \(p(\theta) = 1\), plot the posterior density \(p(\theta)\times P(\sum_{i=1}^n{Y_i} = 57|\theta)\) as a function of \(\theta\).
- As discussed above, the posterior distribution of \(\theta\) is \(\text{Beta}(1+57,1+100-57)\). Plot the posterior density as a function of \(\theta\). Discuss the relationships among all of the plots you have made for this exercise.
References
- Wikipedia Bayesian Inference
- Christou N, Dinov, ID (2011) Confidence Interval Based Parameter Estimation—A New SOCR Applet and Activity. PLoS ONE 6(5): e19178. doi:10.1371/journal.pone.0019178.
- SOCR Home page: https://socr.umich.edu
Translate this page: