Difference between revisions of "SMHS BayesianInference"

From SOCR
Jump to: navigation, search
m (Demonstration)
 
(9 intermediate revisions by the same user not shown)
Line 11: Line 11:
  
 
====Conditional Probability====
 
====Conditional Probability====
The conditional probability of event $A$ occurring given that event $B$ occurs is $ P(A│B)=(P(A\cap B))/(P(B)) $. If $A$ and $B$ are independent then knowing $B$, or $B^c$, gives no information on the probability of $A$, i.e., $ P(A│B)=P(A) $.
+
The conditional probability of event <math>A</math> occurring given that event <math>B</math> occurs is <math> P(A|B) = \frac{P(A \cap B)}{P(B)} </math>. If <math>A</math> and <math>B</math> are independent then knowing <math>B</math>, or <math>B^c</math>, gives no information on the probability of <math>A</math>, i.e., <math> P(A|B)=P(A) </math>.
  
 
====Multiplication rule====
 
====Multiplication rule====
For any two events, $A$ and $B$, $ P(A\cap B)=P(A│B)P(B) $. In general, for $n$ events $A_1, ..., A_n$: $ P(A_1 \cap A_2 \cap A_3 \cap \cap A_n ) =$ $ P(A_1 )P(A_1│A_2 )P(A_3│A_1\cap A_2 ) P(A_n│A_1\cap A_1\cap A_2\cap A_3\cap \cap A_{n-1}) $.
+
For any two events, <math>A</math> and <math>B</math>, <math> P(A\cap B)=P(A|B)P(B) </math>. In general, for <math>n</math> events <math>A_1, ..., A_n</math>: <math> P(A_1 \cap A_2 \cap A_3 \cap \cdots \cap A_n ) = P(A_1) P(A_2|A_1) P(A_3|A_1\cap A_2) \cdots P(A_n|A_1\cap A_2\cap A_3\cap \cdots \cap A_{n-1})</math>.
  
 
====Law of total probability====
 
====Law of total probability====
If $ {A_1,,A_n} $ forms a partition the sample space $S$, then for every event $B$, the probability of $B$ to occur can be split into:
+
If <math> \{A_1,\ldots,A_n\} </math> forms a partition of the sample space <math>S</math>, then for every event <math>B</math>, the probability of <math>B</math> to occur can be split into:
$$P(B)=P(B│A_1 )P(A_1 )+P(B│A_2 )P(A_2 )++P(B│A_n )P(A_n).$$
+
<math>P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)+\cdots+P(B|A_n)P(A_n).</math>
  
 
====Inverting the order of conditioning====
 
====Inverting the order of conditioning====
As event intersection is commutative, $ P(A \cap B)  = P(A | B) \times  P(B) =  P(B | A) \times P(A) $.
+
As event intersection is commutative, <math> P(A \cap B)  = P(A | B) \times  P(B) =  P(B | A) \times P(A) </math>.
  
 
====Bayesian Rule====
 
====Bayesian Rule====
Combining the properties above, for any partition the sample space $S$, $ {A_1,,A_n}$, and any two events $A$ and $B$ (i.e., subsets of $S$) we have:  
+
Combining the properties above, for any partition the sample space <math>S</math>, <math> \{A_1,\ldots,A_n\}</math>, and any two events <math>A</math> and <math>B</math> (i.e., subsets of <math>S</math>) we have:
$$ P(A | B) = {P(B | A) P(A) \over P(B)} = {P(B | A) P(A) \over P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}. $$
+
<math>P(A | B) = \frac{P(B | A) P(A)}{P(B)} = \frac{P(B | A) P(A)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}.</math>
  
 
====Interpretation====
 
====Interpretation====
To interpret of some of the probability concepts above we can think of $P(A)$ as a ''prior distribution'' of event $A$, which is the initial degree of belief in $A$ (before seeing any data/evidence). $P(A|B)$ is referred to as the ''posterior distribution'' of event $A$ given (the evidence) $B$ representing the degree change of our initial belief having observed $B$.  
+
To interpret of some of the probability concepts above we can think of <math>P(A)</math> as a ''prior distribution'' of event <math>A</math>, which is the initial degree of belief in <math>A</math> (before seeing any data/evidence). <math>P(A|B)</math> is referred to as the ''posterior distribution'' of event <math>A</math> given (the evidence) <math>B</math> representing the degree change of our initial belief having observed <math>B</math>.
  
The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events $A$ are of interest, not just two, the rule can be rephrased as posterior proportional to prior times likelihood, which can be expressed as $P(A|B) P(A)P(B|A)$, where the proportionality symbol () means that the left hand side is proportional to the right hand side as $A$ varies given $B$.  
+
The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events <math>A</math> are of interest, not just two, the rule can be rephrased as posterior proportional to prior times likelihood, which can be expressed as <math>P(A|B) \propto P(A)P(B|A)</math>, where the proportionality symbol (<math>\propto</math>) means that the left hand side is proportional to the right hand side as <math>A</math> varies given <math>B</math>.
  
Given events $A_1,A_2$ and $B$, with Bayes’ rule we have that the conditional odds of $A_1:A_2$ given $B$ are equal to the marginal odds of $A_1$ to $A_2$ (i.e., $A_1:A_2$) multiplied by the Bayes factor or likelihood ratio $Λ$. That is, $O(A_1:A_2│B)=Λ(A_1: A_2│B) \times O(A_1:A_2)$, where (A_1: A_2│B)=\frac{P(B│A_1)}{P(B│A_2)}$.  The odds $O(A_1: A_2 )=\frac{P(A_1 )}{P(A_2)}$, and conditional odds $O(A_1: A_2│B)=\frac{P(A_1│B)}{P(A_2│B)}$ are known as prior odds and posterior odds respectively.  
+
Given events <math>A_1,A_2</math> and <math>B</math>, with Bayes’ rule we have that the conditional odds of <math>A_1:A_2</math> given <math>B</math> are equal to the marginal odds of <math>A_1</math> to <math>A_2</math> (i.e., <math>A_1:A_2</math>) multiplied by the Bayes factor or likelihood ratio <math>\Lambda</math>. That is, <math>O(A_1:A_2|B)=\Lambda(A_1: A_2|B) \times O(A_1:A_2)</math>, where <math>\Lambda(A_1: A_2|B)=\frac{P(B|A_1)}{P(B|A_2)}</math>.  The odds <math>O(A_1: A_2 )=\frac{P(A_1 )}{P(A_2)}</math>, and conditional odds <math>O(A_1: A_2|B)=\frac{P(A_1|B)}{P(A_2|B)}</math> are known as prior odds and posterior odds respectively.
  
For the special case, where $A_1=A^c≡A'$, this translated into using two competing hypotheses concerning the cause of $B$. If we have $O(A)=O(A:A^c)$, the odds on A is by definition the odds for and against $A$. Bayes’ rule can then be written as $O(A│B)=O(A)Λ(A│B)$, that is the posterior odds on $A$ equals the prior odds on $A$ times the likelihood ratio for $A$ given $B$ (posterior odds equal prior odds times likelihood ratio).
+
For the special case, where <math>A_1=A^c\equiv A'</math>, this translates into using two competing hypotheses concerning the cause of <math>B</math>. If we have <math>O(A)=O(A:A^c)</math>, the odds on A is by definition the odds for and against <math>A</math>. Bayes’ rule can then be written as <math>O(A|B)=O(A)\Lambda(A|B)</math>, that is the posterior odds on <math>A</math> equals the prior odds on <math>A</math> times the likelihood ratio for <math>A</math> given <math>B</math> (posterior odds equal prior odds times likelihood ratio).
  
 
====Statistical inference====
 
====Statistical inference====
The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are:  
+
The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are:
# collect a random sample from the population;  
+
# collect a random sample from the population;
# determine a probability model for the data $y_1,...,y_n$ obtained (e.g. we are collecting data on the height of US women: each women’s height can be thought as a random variable with a normal distribution). This is called the sampling model;  
+
# determine a probability model for the data <math>y_1,...,y_n</math> obtained (e.g., we are collecting data on the height of US women: each woman’s height can be thought of as a random variable with a normal distribution). This is called the sampling model;
# the sampling model will depend on a set of parameters, $θ$: $p(y_i |θ)$, for $i=1,...,n$;
+
# the sampling model will depend on a set of parameters, <math>\theta</math>: <math>p(y_i | \theta)</math>, for <math>i=1,...,n</math>;
# using the sampling model, construct the likelihood function $L(θ; y )$;
+
# using the sampling model, construct the likelihood function <math>L(\theta; y)</math>;
# determine the [[SMHS_CIs#Maximum_likelihood_estimation_.28MLE.29|maximum likelihood estimate (MLE)]] of θ by maximizing the likelihood function $L(θ;y)$.
+
# determine the [[SMHS_CIs#Maximum_likelihood_estimation_.28MLE.29|maximum likelihood estimate (MLE)]] of <math>\theta</math> by maximizing the likelihood function <math>L(\theta; y)</math>.
  
In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown.  
+
In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown.
  
The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), $θ$ of the sampling model in our learning procedure. In Bayesian statistics, before we observe the data, we have some belief regarding possible values, or range of values for the set of parameters $θ$. This knowledge is expressed in the form of a prior distribution. Since we are assigning a distribution to $θ$, we are now assuming that θ is a random variable (univariate if $θ$ is just a single parameter, multivariate if $θ$ consists of more than one parameter).
+
The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), <math>\theta</math> of the sampling model in our learning procedure. In Bayesian statistics, before we observe the data, we have some belief regarding possible values, or range of values for the set of parameters <math>\theta</math>. This knowledge is expressed in the form of a prior distribution. Since we are assigning a distribution to <math>\theta</math>, we are now assuming that <math>\theta</math> is a random variable (univariate if <math>\theta</math> is just a single parameter, multivariate if <math>\theta</math> consists of more than one parameter).
  
Since $θ$ is random , there is a sample space associated with $θ$. This parameter space is typically denoted by $Θ$. The parameter space $\{\theta \in \Theta \}$ is the set of all possible values for the parameter.  
+
Since <math>\theta</math> is random, there is a sample space associated with <math>\theta</math>. This parameter space is typically denoted by <math>\Theta</math>. The parameter space <math>\{\theta \in \Theta \}</math> is the set of all possible values for the parameter.
  
Once data $y_1,...,y_n$ have been collected, we specify a sampling model for the data $p(y_i│θ)$, $i=1,,n$, which represents the probability of observing $y_i$ if we knew $θ$. In light of the data observed, using Bayes’ theorem we update our belief about $θ$ and drive the posterior distribution $p(θ│y_1,...,y_n)=p(θ|y)$, where $y=\{y_1,...,y_n\}$.
+
Once data <math>y_1,...,y_n</math> have been collected, we specify a sampling model for the data <math>p(y_i|\theta)</math>, <math>i=1,\ldots,n</math>, which represents the probability of observing <math>y_i</math> if we knew <math>\theta</math>. In light of the data observed, using Bayes’ theorem we update our belief about <math>\theta</math> and derive the posterior distribution <math>p(\theta|y_1,...,y_n)=p(\theta|y)</math>, where <math>y=\{y_1,...,y_n\}</math>.
  
====Advantages of Bayesian methods====
+
==== Advantages of Bayesian Methods ====
* ''Interpretation'': Having a distribution for the unknown parameter(s) $θ$ makes it easier to understand what a point estimate and a confidence interval mean. For example: suppose the parameter of interest here is the population mean $θ$. We have collected data and we build a 95% confidence interval for $θ$. This is given by: $\hat{y} \pm 1.96  \frac{s}{\sqrt{n}}$. After the sample is collected and the interval is created, it either contains $θ$ or it does not. So, 95% is not the probability that $θ$ falls in the interval. For a frequentist, 95% is not a coverage probability, it is just a tag that informs how the interval performs over the long haul. By contrast, Bayesian confidence intervals, called credible sets, convey the posterior probability that the parameter falls in the interval. 
 
  
* ''Likelihood Principle'': Bayesian inference obeys the likelihood principle. The likelihood principle states that if two sampling models yield proportional likelihoods for $θ$, then inference about $θ$ should be identical under the two models. Frequentist inference does not always obey the likelihood principle! For example: suppose we observe the clinical outcomes in experimental treatment applied on 12 patients and at the end of the trial we observe 3 patient cured and 9 still sick. We want to test the hypothesis that the probability of successful cure is better than 50/50, i.e., $H_o$∶ $θ=1/2$ vs. $H_a$∶ $θ>1/2$. Suppose that we set up two sampling models for these data:
+
Bayesian methods offer several key advantages over frequentist approaches in statistical inference. Below, we outline these benefits, highlighting how they address limitations in frequentist methods.
** Model 1: Y, number of successful treatments, as a [[SMHS_ProbabilityDistributions#Binomial_distribution|Binomial(n=12,θ)]];
 
** Model 2: Y, number of successfully treated patients to be observed before terminating the clinical trial (in our example, getting 3 patients cured), which is a [[SMHS_ProbabilityDistributions#Negative_binomial_distribution| Negative Binomial(r=3,1-θ)]].  
 
  
: The two likelihoods of the 2 models are proportional but they lead to two different p-values.  
+
* '''Interpretation''': Bayesian inference provides a posterior distribution for the unknown parameter(s) <math>\theta</math>, which allows for direct probabilistic statements about <math>\theta</math>. For instance, suppose the parameter of interest is the population mean <math>\theta</math>. After collecting data, a Bayesian 95% credible interval directly indicates that there is a 95% posterior probability that <math>\theta</math> lies within the interval. In contrast, a frequentist 95% confidence interval, such as <math>\hat{y} \pm 1.96 \frac{s}{\sqrt{n}}</math>, does not represent the probability that <math>\theta</math> is in the interval after the data are observed—it either contains <math>\theta</math> or it does not. Instead, the 95% refers to the long-run coverage probability across repeated hypothetical samples.
  
* Bayesian theorem provides a rational method for learning. Suppose you collect data $y_1,...,y_n$ and compute your posterior distribution $p(θ|y_1,...,y_n)$. If we later collect an additional observation $y_{n+1}$, then, the posterior distribution $p(θ|y_1,...,y_n)$ could be used as a prior for the current Bayesian data analysis.
+
* '''Likelihood Principle''': Bayesian inference adheres to the likelihood principle, which asserts that if two different sampling models produce proportional likelihood functions for <math>\theta</math> given the observed data, then inferences about <math>\theta</math> should be identical under both models. Frequentist methods, however, can violate this principle, leading to different conclusions from the same data depending on the sampling design. For example, consider a clinical trial where we observe 9 patients cured and 3 still sick. We wish to test the null hypothesis <math>H_0: \theta = 1/2</math> against the alternative <math>H_a: \theta > 1/2</math>, where <math>\theta</math> is the probability of a successful cure (i.e., testing if the cure rate is better than 50/50).
 
* Bayesian inference does not require large sample theory. Bayesian methods do not require asymptotics for valid inference. Small sample Bayesian inference proceeds in the same way as if one had a large sample.
 
 
* Finally, Bayesian inference often includes frequentist inference as a special case. One can often obtain frequentist answers by choosing a uniform prior for the parameters. In this cases, frequentist answers can be obtained from the posterior distribution, where the MLE is then the posterior mode.
 
  
* The choice of the prior distribution is a delicate issue and the main criticism to Bayesian inference. Two people analyzing the same data but using different priors could obtain different answers. When carry out a Bayesian analysis it is important to examine the prior sensitivity, that is, determine how sensitive are the results to the choice of the prior distribution.
+
  ** Model 1: Fixed number of trials (Binomial).
 +
Let <math>Y_1</math> be the number of successful treatments out of 12 patients, modeled as [[SMHS_ProbabilityDistributions#Binomial_distribution|Binomial(n=12, \theta)]]. The observed data is <math>Y_1 = 9</math>.
 +
 
 +
  ** Model 2: Stop when a fixed number of failures occur (Negative Binomial).
 +
Let <math>Y_2</math> be the number of successful treatments observed before reaching 3 failures (sick patients), modeled as a [[SMHS_ProbabilityDistributions#Negative_binomial_distribution|Negative Binomial(r=3, \theta)]] (number of successes before r failures, with success probability <math>\theta</math>). The observed data is <math>Y_2 = 9</math>.
 +
 
 +
The likelihood functions under both models are proportional (both involve <math>\theta^9 (1-\theta)^3</math> up to a constant), yet the frequentist p-values differ: approximately 0.073 for the binomial model and 0.033 for the negative binomial model. At a 5% significance level, the negative binomial model would reject <math>H_0</math>, while the binomial model would not, despite the data being the same.
 +
 
 +
* '''Sequential Learning and Updating''': Bayes' theorem provides a coherent framework for updating beliefs with new data. If you have a posterior distribution <math>p(\theta \mid y_1, \dots, y_n)</math> from initial data <math>y_1, \dots, y_n</math>, it can serve as the prior for incorporating a new observation <math>y_{n+1}</math>, yielding an updated posterior <math>p(\theta \mid y_1, \cdots, y_{n+1})</math>. This enables rational, incremental learning without starting from scratch each time.
 +
 
 +
* '''No Reliance on Large Samples''': Unlike many frequentist methods that depend on asymptotic approximations (e.g., normal approximations for valid inference), Bayesian inference is exact for any sample size. The posterior distribution is derived directly from the likelihood and prior, making it suitable for small-sample problems where frequentist methods may fail.
 +
 
 +
* '''Incorporation of Prior Information''': Bayesian methods allow the integration of prior knowledge or expert opinion into the analysis through the prior distribution, which can be particularly useful in fields with existing domain expertise, such as medicine or engineering.
 +
 
 +
* '''Flexibility in Modeling''': Bayesian approaches handle complex, hierarchical models and missing data more naturally via techniques like Markov Chain Monte Carlo (MCMC) simulation, without needing ad-hoc adjustments common in frequentist statistics.
 +
 
 +
* '''Frequentist Methods as a Special Case''': Bayesian inference can reproduce frequentist results under certain conditions, such as using non-informative (e.g., uniform) priors. In these cases, the posterior mode often coincides with the maximum likelihood estimate (MLE), and credible intervals may align with confidence intervals.
 +
 
 +
==== Criticisms and Considerations ====
 +
 
 +
While Bayesian methods have many strengths, they are not without challenges:
 +
 
 +
* '''Prior Sensitivity''': The choice of prior distribution can influence results, especially with limited data. Different analysts using different priors on the same data might reach different conclusions. To mitigate this, it is essential to perform sensitivity analyses, examining how results change under alternative priors. Objective or non-informative priors (e.g., Jeffreys priors) can help reduce subjectivity in some cases.
 +
 
 +
To further enhance this section, consider adding cross-references to related wiki pages (e.g., on priors, MCMC, or specific distributions) and including illustrative figures or simulations for the examples.
  
 
===Bayes theorem for density functions===
 
===Bayes theorem for density functions===
  
* '''Marginal density''' $p(y)=\int_{θ∈Θ}{p(y,θ)}=\int_{θ∈Θ}{p(y|θ)p(θ)}$, where $Θ$ is the parameter space. Substituting the expression for $p(y)$, Bayes’ theorem for density function can be written as $p(θ|y)=\frac{p(y|θ)p(θ)}{p(y)} = \frac{p(y|θ)p(θ)}{\int_{θ∈Θ}{p(y|θ)p(θ)}}$. Given that $p(θ|y)$ is a function of $θ$, the denominator is a constant and thus, the posterior density of $θ$ is often expressed up to a proportionality constant as $p(θ|y)∝p(y|θ)p(θ)$, where the sign "∝" means “proportional to”.
+
* '''Marginal density''' <math>p(y)=\int_{\theta \in \Theta}{p(y,\theta)d\theta}=\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}</math>, where <math>\Theta</math> is the parameter space. Substituting the expression for <math>p(y)</math>, Bayes’ theorem for density function can be written as <math>p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}}</math>. Given that <math>p(\theta|y)</math> is a function of <math>\theta</math>, the denominator is a constant and thus, the posterior density of <math>\theta</math> is often expressed up to a proportionality constant as <math>p(\theta|y) \propto p(y|\theta)p(\theta)</math>, where the sign <math>\propto</math> means “proportional to”.
 +
 
 +
: The proportionality constant in the posterior density <math>p(\theta|y)</math> can be determined using the fact that the posterior density <math>p(\theta|y)</math> is a [[SMHS_ProbabilityDistributions#Theory|PDF]]. Suppose that we have observed <math>Y=y</math>, and we want to evaluate and compare the odds of two possible numerical values for <math>\theta</math>, then the probability (density) of <math>\theta_a</math> relative to <math>\theta_b</math> is:
 +
<math>\frac{p(\theta_a|y)}{p(\theta_b|y)} = \frac{\frac{p(y|\theta_a)p(\theta_a)}{p(y)}}{\frac{p(y|\theta_b)p(\theta_b)}{p(y)}}= \frac{p(y|\theta_a)p(\theta_a)}{p(y|\theta_b)p(\theta_b)},</math>
  
: The proportionality constant in the posterior density $p(θ|y)$ can be determined using the fact that the posterior density $p(θ|y)$ is a [[SMHS_ProbabilityDistributions#Theory|PDF]]. Suppose that we have observed $Y=y$, and we want to  evaluate and compare the odds of two possible numerical value for $θ$, let probability (density) of $θ_a$ relative to $θ_b$ is:
+
: where <math>\frac{p(\theta_a)}{p(\theta_b)}</math> is the prior odds and <math>\frac{p(y|\theta_a)}{p(y|\theta_b)}</math> is the likelihood ratio. In comparing the posterior probability of <math>\theta_a</math> and <math>\theta_b</math>, it is not necessary to compute <math>p(y)</math>.
$$\frac{p(θ_a│y)}{p(θ_b│y)} = \frac{\frac{p(y|θ_a )p(θ_a)}{p(y)}}{\frac{p(y|θ_b )p(θ_b)}{p(y)}}= \frac{p(y|θ_a )p(θ_a )}{p(y|θ_b )p(θ_b )},$$
 
  
: where $\frac{p(θ_a )}{p(θ_b )}$ is the prior odds and $\frac{p(y|θ_a )}{p(y|θ_b )}$ is the likelihood ratio. In comparing the posterior probability of $θ_a$ and $θ_b$, it is not necessary to compute $p(y)$.
 
 
 
====Prior Distributions====
 
====Prior Distributions====
How to choose a prior distribution $p(θ)$? There are various type of prior distributions:  
+
How to choose a prior distribution <math>p(\theta)</math>? There are various types of prior distributions:
  
* Non-informative priors: a prior distributions $p(θ)$ is non-informative if it has minimal impact on the posterior distribution $p(θ|y)$ of $θ$. In general, a non-informative prior is one which is dominated by the likelihood, that is, it is a prior that does not change very much over the region in which the likelihood is appreciable and does not assume large values outside that range. A prior that has these properties is said to be locally uniform. Non-informative priors are also called reference priors, vague priors or flat priors.
+
* Non-informative priors: a prior distribution <math>p(\theta)</math> is non-informative if it has minimal impact on the posterior distribution <math>p(\theta|y)</math> of <math>\theta</math>. In general, a non-informative prior is one which is dominated by the likelihood, that is, it is a prior that does not change very much over the region in which the likelihood is appreciable and does not assume large values outside that range. A prior that has these properties is said to be locally uniform. Non-informative priors are also called reference priors, vague priors or flat priors.
  
 
* Examples:
 
* Examples:
** If $θ ∈ Θ = [0,1]$, then $p(θ) = Uniform(0,1)$ is a non-informative prior and $p(θ) = 1$ for $θ ∈ Θ = [0,1]$.
+
** If <math>\theta \in \Theta = [0,1]</math>, then <math>p(\theta) = \text{Uniform}(0,1)</math> is a non-informative prior and <math>p(\theta) = 1</math> for <math>\theta \in \Theta = [0,1]</math>.
** If $θ ∈ Θ = R$, then if $p(θ) = Normal(μ_0,σ_0^2)$ with $σ_0^2 \longrightarrow \infty$, we get a non-informative prior. That is, we can pick $σ_0^2$ large enough so that we obtain a non-informative prior.  
+
** If <math>\theta \in \Theta = \mathbb{R}</math>, then if <math>p(\theta) = \text{Normal}(\mu_0,\sigma_0^2)</math> with <math>\sigma_0^2 \longrightarrow \infty</math>, we get a non-informative prior. That is, we can pick <math>\sigma_0^2</math> large enough so that we obtain a non-informative prior.
  
* Improper priors: a prior distribution $p(θ)$ on $θ ∈ Θ$ is said to be improper if $\int_{θ ∈ Θ}{p(θ)}=∞$. Improper priors are often used in Bayesian inference because they yield non-informative priors. For example: if $θ ∈ Θ=R$ and $p(θ)∝1$, then $p(θ)$ is an improper prior since clearly for $\int_{θ ∈ Θ}{p(θ)}=\int_{-\infty}^{+\infty}{}$, which is undefined. An improper prior $p(θ)$ can result in an improper posterior distribution $p(θ|y)$. One cannot make inference with improper posterior distributions! An improper prior p(θ) may still lead to a proper posterior distribution. If $p(θ)$ is an improper prior, then the posterior distribution $p(θ|y)$ has always a particular form. In fact, if $p(θ)$ is improper than $p(θ) = c$ for some $c > 0$ and $p(θ│y )=\frac{p(y |θ)•p(θ)}{p(y)} =\frac{p(y |θ)•p(θ)}{\int_{θ∈Θ}{p(y│θ)cdθ}}=\frac{p(y |θ)•p(θ)}{\int_{θ∈Θ}{p(y│θ)cdθ}} = \frac{1}{K} p(y |θ)$, where $K=\int_{θ∈Θ}{p(y│θ)}$. That is, the posterior distribution, $p(θ|y)$, is just the renormalized likelihood.  
+
* Improper priors: a prior distribution <math>p(\theta)</math> on <math>\theta \in \Theta</math> is said to be improper if <math>\int_{\theta \in \Theta}{p(\theta)d\theta}=\infty</math>. Improper priors are often used in Bayesian inference because they yield non-informative priors. For example: if <math>\theta \in \Theta=\mathbb{R}</math> and <math>p(\theta) \propto 1</math>, then <math>p(\theta)</math> is an improper prior since clearly <math>\int_{\theta \in \Theta}{p(\theta)d\theta}=\int_{-\infty}^{+\infty}{d\theta}</math> is undefined. An improper prior <math>p(\theta)</math> can result in an improper posterior distribution <math>p(\theta|y)</math>. One cannot make inference with improper posterior distributions! An improper prior <math>p(\theta)</math> may still lead to a proper posterior distribution. If <math>p(\theta)</math> is an improper prior, then the posterior distribution <math>p(\theta|y)</math> has always a particular form. In fact, if <math>p(\theta)</math> is improper then <math>p(\theta) = c</math> for some <math>c > 0</math> and <math>p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}}=\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}} = \frac{1}{K} p(y|\theta)</math>, where <math>K=\int_{\theta\in\Theta}{p(y|\theta)d\theta}</math>. That is, the posterior distribution, <math>p(\theta|y)</math>, is just the renormalized likelihood.
  
* One-parameter models: Binomial model as a start in performing Bayesian inference for binary variables. Consider data from a randomized controlled clinical trial evaluating a drug to reperfuse blocked blood vessels in patients with ischemic stroke. We indicate with $Y_i$ whether the $i^{th}$ patient treated with the drug had a favorable outcome (no disability at 90 days). Then the binary outcome variable is defined by:
+
====One-parameter models: Binomial model====
$$
+
Consider data from a randomized controlled clinical trial evaluating a drug to reperfuse blocked blood vessels in patients with ischemic stroke. We indicate with <math>Y_i</math> whether the <math>i^{th}</math> patient treated with the drug had a favorable outcome (no disability at 90 days). Then the binary outcome variable is defined by:
 +
<math>
 
Y_i =
 
Y_i =
 
\begin{cases}
 
\begin{cases}
Line 96: Line 114:
 
0, & \text{otherwise}
 
0, & \text{otherwise}
 
\end{cases}
 
\end{cases}
.$$
+
.</math>
  
 
===Demonstration===
 
===Demonstration===
Suppose that we have data from $n = 129$ patients. Then, since the size $N$ of the general population is much greater than the size $n$ of the sample, $Y_1,...,Y_n$ can be considered to be exchangeable random variables. Using the [http://en.wikipedia.org/wiki/De_Finetti's_theorem De Finetti’s theorem], we can set up the following model for our data: a prior distribution $p(θ)$ for a parameter $θ$, representing the proportion of 1’s in the population $\left( \sum_{i=1}^N{\frac{Y_i}{N}}\right)$, given $θ$, the $Y_1,...,Y_n$ are i.i.d.’s with $p(y│θ)=θ^y (1-θ)^{1-y}$, the sampling model adopted leads to $p(y_1,...,y_n│θ)=p(y_1,...,y_{129}│θ)$=$\cup_{i=1}^{129}{θ^{y_i}(1-θ)^{1-y_i}}=$ $θ^{\sum_{i=1}^{129}{y_i}}(1-θ)^{129-\sum_{i=1}^{129}{y_i}}$, now we need to choose a prior distribution for $θ$.  
+
Suppose that we have data from <math>n = 129</math> patients. Then, since the size <math>N</math> of the general population is much greater than the size <math>n</math> of the sample, <math>Y_1,...,Y_n</math> can be considered to be exchangeable random variables. Using the [http://en.wikipedia.org/wiki/De_Finetti's_theorem De Finetti’s theorem], we can set up the following model for our data: a prior distribution <math>p(\theta)</math> for a parameter <math>\theta</math>, representing the proportion of 1’s in the population <math>\left( \sum_{i=1}^N{\frac{Y_i}{N}}\right)</math>, given <math>\theta</math>, the <math>Y_1,...,Y_n</math> are i.i.d.’s with <math>p(y|\theta)=\theta^y (1-\theta)^{1-y}</math>, the sampling model adopted leads to <math>p(y_1,...,y_n|\theta)=p(y_1,...,y_{129}|\theta)= \prod_{i=1}^{129}{\theta^{y_i}(1-\theta)^{1-y_i}}= \theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}</math>. Now we need to choose a prior distribution for <math>\theta</math>.
 +
 
 +
The parameter <math>\theta</math> is a random variable with parameter space <math>\Theta=[0,1]</math>. A first choice would be to use a locally uniform prior, that is a prior distribution <math>p(\theta)</math> that assigns the same probability to all intervals <math>(a,b) \subseteq [0,1]</math> of the same length. This leads to <math>p(\theta) \sim \text{Uniform}([0,1])</math>, i.e., <math>p(\theta) = 1</math> for <math>\theta \in [0,1]</math>. Then, we obtain the following posterior distribution: <math>p(\theta|y_1,...,y_{129})=\frac{p(y_1,...,y_{129}|\theta)p(\theta)}{p(y_1,...,y_{129})}= p(y_1,...,y_{129}|\theta)\times \frac{1}{p(y_1,...,y_{129})} \propto p(y_1,...,y_{129}|\theta)</math> since <math>p(y_1,...,y_{129})</math> does not depend on <math>\theta</math>. Therefore, the posterior distribution <math>p(\theta|y_1,...,y_{129})</math> has the same shape as <math>p(y_1,...,y_{129}|\theta)</math> but they are on a different scale.
  
The parameter $θ$ is a random variable with parameter space $Θ=[0,1]$. A first choice would be to use a locally uniform prior, that is a prior distribution $p(θ)$ that assigns the same probability to all intervals $(a,b) \subseteq [0,1]$ of the same length. This leads to $p(θ) \sim Uniform([0,1])$, i.e., $p(θ) = 1$ for $θ \in [0,1]$. Then, we obtain the following posterior distribution: $p(θ│y_1,...,y_{129})=\frac{p(y_1,...,y_{129}│θ)p(θ)}{p(y_1,...,y_{129})}=$ $p(y_1,...,y_{129}│θ)\times \frac{1}{p(y_1,...,y_{129})} ∝ p(y_1,...,y_{129}│θ)$ since $p(y_1,...,y_{129})$ does not depend on $θ$. Therefore, the posterior distribution $p(y_1,...,y_{129}│θ)$ of $θ$ has the same shape as $p(y_1,...,y_{129}│θ)$ but they are on a different scale.
+
If of the <math>n=129</math> patients receiving the drug 118 had favorable outcomes and 11 did not, then we have <math>\sum_{i=1}^{129}{y_i}=118</math>, so
+
<math>p(y_1,...,y_{129}|\theta)=\theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}=\theta^{118} (1-\theta)^{11}.</math>
If of the n=129 patients receiving the drug 118 had favorable outcomes and 11 did not, then we have $\sum_{i=1}^{129}{y_i}=118$, so $$p(y_1,...,y_{129}│θ)=θ^{\sum_{i=1}^{129}{y_i(1-θ)^{129-\sum_{i=1}^{129}{y_i}}}}=θ^{118} (1-θ)^{11}.$$
 
  
Definition has that a random variable $Y$ with sample space $S=[0,1]$ is said to have a [[AP_Statistics_Curriculum_2007_Beta|Beta distribution (a,b)]] if its PDF is given by $p(y│a,b)=\frac{Γ(a+b)}{Γ(a)Γ(b)}  y^{a-1} (1-y)^{b-1}$, with $a,b>0$. For such a random variable, we have the following properties: $E[Y]=\frac{a}{a+b}$, $Mode[Y]=\frac{a-1}{(a-1)+(b-1)}$, if $a>1$ and $b>1$, $Var[Y]=\frac{ab}{(a+b+1) (a+b)^2} =\frac{E[Y](1-E[Y])}{a+b+1}$.
+
Recall that a random variable <math>Y</math> with sample space <math>S=[0,1]</math> is said to have a [[AP_Statistics_Curriculum_2007_Beta|Beta distribution (a,b)]] if its PDF is given by <math>p(y|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}  y^{a-1} (1-y)^{b-1}</math>, with <math>a,b>0</math>. For such a random variable, we have the following properties: <math>E[Y]=\frac{a}{a+b}</math>, <math>Mode[Y]=\frac{a-1}{(a-1)+(b-1)}</math>, if <math>a>1</math> and <math>b>1</math>, <math>Var[Y]=\frac{ab}{(a+b+1) (a+b)^2} =\frac{E[Y](1-E[Y])}{a+b+1}</math>.
  
To determine the scale for the posterior distribution $p(θ│y_1,...,y_n)$ we use the fact that this is a conditional density and since it is a probability density function, it needs to integrate to 1, that is: $1=\int_{θ\in Θ}{p(θ│y_1,...,y_{129})}=\int_0^1{\frac{θ^{118}(1-θ)^{11}}{p(y_1,...,y_{129})} }$.
+
To determine the scale for the posterior distribution <math>p(\theta|y_1,...,y_n)</math> we use the fact that this is a conditional density and since it is a probability density function, it needs to integrate to 1, that is:
 +
<math>1=\int_{\theta\in \Theta}{p(\theta|y_1,...,y_{129})d\theta}=\int_0^1{\frac{\theta^{118}(1-\theta)^{11}}{p(y_1,...,y_{129})} d\theta}.</math>
  
The following result from calculus is useful: $\int_0^1{θ^{a-1} (1-θ)^{b-1} }=\frac{Γ(a)Γ(b)}{Γ(a+b)}$, where for any $x>0$, the gamma function, (x)$, can be looked up in a table or computed using SOCR or R using command ''gamma()''. Thus, we have $1=\int_0^1{\frac{θ^{118} (1-θ)^{11}}{p(y_1,...,y_{129})}}=$ $\frac{1}{p(y_1,...,y_{129})}  
+
The following result from calculus is useful:
int_0^1{θ^{118} (1-θ)^{11} }=\frac{1}{p(y_1,...,y_{129})}  \frac{Γ(119)Γ(12)}{Γ(131)}$. Hence, we have that $p(θ│y_1,...,y_{129})=\frac{Γ(131)}{Γ(119)Γ(12)} θ^{118} (1-θ)^{11}$. That is, $p(θ│y_1,...,y_{129})$ has a [[AP_Statistics_Curriculum_2007_Beta|Beta distribution]] with $a=119$ and $b=12$ parameters. Therefore $E[θ│y_1,...,y_{129}]=0.908$, $Mode[θ│y_1,...,y_{129}]=0.915$, and $Var[θ│y_1,...,y_{129}]=6.25E-4$.
+
<math>\int_0^1{\theta^{a-1} (1-\theta)^{b-1} d\theta}=\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}</math>,
 +
where for any <math>x>0</math>, the gamma function, <math>\Gamma(x)</math>, can be looked up in a table or computed using SOCR or R using command ''gamma()''. Thus, we have
 +
<math>1=\int_0^1{\frac{\theta^{118} (1-\theta)^{11}}{p(y_1,...,y_{129})}d\theta}= \frac{1}{p(y_1,...,y_{129})} \int_0^1{\theta^{118} (1-\theta)^{11} d\theta}=\frac{1}{p(y_1,...,y_{129})}  \frac{\Gamma(119)\Gamma(12)}{\Gamma(131)}.</math>
 +
Hence, we have that
 +
<math>p(\theta|y_1,...,y_{129})=\frac{\Gamma(131)}{\Gamma(119)\Gamma(12)} \theta^{118} (1-\theta)^{11}.</math>
 +
That is, <math>p(\theta|y_1,...,y_{129})</math> has a [[AP_Statistics_Curriculum_2007_Beta|Beta distribution]] with <math>a=119</math> and <math>b=12</math> parameters. Therefore <math>E[\theta|y_1,...,y_{129}]=0.908</math>, <math>Mode[\theta|y_1,...,y_{129}]=0.915</math>, and <math>Var[\theta|y_1,...,y_{129}]=6.25\times 10^{-4}</math>.
  
 
* Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative.
 
* Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative.
  
 
====Binomial model====
 
====Binomial model====
Suppose the data $Y_1,...,Y_n$ represent exchangeable binary random variables (i.e., a sequence such that future samples behave like earlier samples, meaning formally that any order (of a finite number of samples) is equally likely). For our choice of prior and sampling model, we have obtained $p(θ│y_1,...,y_n) \sim Beta(1+\sum_{i=1}^n{y_i}, 1+n-\sum_{i=1}^n{y_i})$. Now, let’s suppose we want to compare the posterior odds of two values $θ_a$ and $θ_b$ of $θ$ in light of an instance of the data $y_1,...,y_n$.  
+
Suppose the data <math>Y_1,...,Y_n</math> represent exchangeable binary random variables (i.e., a sequence such that future samples behave like earlier samples, meaning formally that any order (of a finite number of samples) is equally likely). For our choice of prior and sampling model, we have obtained <math>p(\theta|y_1,...,y_n) \sim \text{Beta}(1+\sum_{i=1}^n{y_i}, 1+n-\sum_{i=1}^n{y_i})</math>. Now, let’s suppose we want to compare the posterior odds of two values <math>\theta_a</math> and <math>\theta_b</math> of <math>\theta</math> in light of an instance of the data <math>y_1,...,y_n</math>.
  
 
The odds ratio is:
 
The odds ratio is:
$$ \frac{p(θ_a│y_1,...,y_n )}{p(θ_b│y_1,...,y_n)}=\frac{\frac{Γ(n+2)}{Γ(1+\sum_{i=1}^n{y_i}) Γ(1+n-\sum_{i=1}^n{y_i})} {θ_a}^{\sum_{i=1}^n{y_i}} (1-θ_a )^{n-\sum_{i=1}^n{y_i}}}
+
<math>\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}}
{\frac{Γ(n+2)}{Γ(1+\sum_{i=1}^n{y_i}) Γ(1+n-\sum_{i=1}^n{y_i})} {θ_b}^{\sum_{i=1}^n{y_i}} (1-θ_b)^{n-\sum_{i=1}^n{y_i}}}.$$
+
{\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}.</math>
  
$$ \frac{p(θ_a│y_1,...,y_n )}{p(θ_b│y_1,...,y_n)}=\frac{{θ_a}^{\sum_{i=1}^n{y_i}} (1-θ_a )^{n-\sum_{i=1}^n{y_i}}}
+
Simplifying, we get:
{ {θ_b}^{\sum_{i=1}^n{y_i}} (1-θ_b)^{n-\sum_{i=1}^n{y_i}}}=
+
<math>\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{{\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}}
\left( \frac{θ_a}{θ_b}\right)^{\sum_{i=1}^n{y_i}}
+
{ {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}=
\left( \frac{1-θ_a}{1-θ_b}\right)^{n-\sum_{i=1}^n{y_i}}.$$
+
\left( \frac{\theta_a}{\theta_b}\right)^{\sum_{i=1}^n{y_i}}
 +
\left( \frac{1-\theta_a}{1-\theta_b}\right)^{n-\sum_{i=1}^n{y_i}}.</math>
  
Hence, the posterior odds depend on the data only through $\sum_{i=1}^n{y_i}$. In general, this is also true for the posterior probability, since $p(θ│y_1,...,y_n)=Beta(1+\sum_{i=1}^n{y_i} ,1+n-\sum_{i=1}^n{y_i}$. In other words, $\sum_{i=1}^n{y_i}$ is a [http://en.wikipedia.org/wiki/Sufficient_statistic sufficient statistic].
+
Hence, the posterior odds depend on the data only through <math>\sum_{i=1}^n{y_i}</math>. In general, this is also true for the posterior probability, since <math>p(\theta|y_1,...,y_n)=\text{Beta}(1+\sum_{i=1}^n{y_i} ,1+n-\sum_{i=1}^n{y_i})</math>. In other words, <math>\sum_{i=1}^n{y_i}</math> is a [http://en.wikipedia.org/wiki/Sufficient_statistic sufficient statistic].
  
 
===Sufficient Statistics===
 
===Sufficient Statistics===
If $Y_1,...,Y_n$ are random variables, a statistic $T(Y_1,...,Y_n)$ is a ''sufficient'' statistic if it captures all the information about $θ$ that is contained in the sample. If $Y_1,...,Y_nare binary random variables that are conditionally independently and identically distributed with $p(y|θ)$ a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process|Bernoulli distribution]] with parameter $θ$, then $\sum_{i=1}^n{y_i}$ is a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Binomial_Random_Variables|Binomial random variable]] and its density is $p(y│θ)={n \choose y} θ^y (1-θ)^{n-y}$, for $y\in \{0,1,,n\}$.
+
If <math>Y_1,...,Y_n</math> are random variables, a statistic <math>T(Y_1,...,Y_n)</math> is a ''sufficient'' statistic if it captures all the information about <math>\theta</math> that is contained in the sample. If <math>Y_1,...,Y_n</math> are binary random variables that are conditionally independently and identically distributed with <math>p(y|\theta)</math> a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process|Bernoulli distribution]] with parameter <math>\theta</math>, then <math>\sum_{i=1}^n{y_i}</math> is a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Binomial_Random_Variables|Binomial random variable]] and its density is <math>p(y|\theta)={n \choose y} \theta^y (1-\theta)^{n-y}</math>, for <math>y\in \{0,1,\ldots,n\}</math>.
  
What would be the posterior distribution of $θ$ if instead of working directly with the random variables $Y_1,...,Y_n$, we work with the sufficient statistic $Y=\sum_{i=1}^n{y_i}$?  
+
What would be the posterior distribution of <math>\theta</math> if instead of working directly with the random variables <math>Y_1,...,Y_n</math>, we work with the sufficient statistic <math>Y=\sum_{i=1}^n{y_i}</math>?
  
Here, \in Θ=[0,1]$ with prior distribution $p(θ)$. Let’s use again a uniform prior $p(θ)=1$ for $θ$, then $p(θ│y)=\frac{p(y│θ)p(θ)}{p(y)} =\frac{{n\choose y} θ^y (1-θ)^{n-y}*1}{p(y)} ∝ θ^y (1-θ)^{n-y}$, for \in [0,1]$. We can determine again the proportionality constant by using the fact that $p(θ│y)$ is a density and thus it needs to integrate 1. We can see that the part of $p(θ│y)$ that depends on $θ$ has the same functional form as the kernel of a Beta distribution: $$Beta(x;a,b)=\frac{Γ(a+b)}{Γ(a)Γ(b)}  x^{a-1} (1-x)^{b-1},$$
+
Here, <math>\theta\in \Theta=[0,1]</math> with prior distribution <math>p(\theta)</math>. Let’s use again a uniform prior <math>p(\theta)=1</math> for <math>\theta</math>, then
where $x\in [0,1]$, and $a$ and $b$ are positive. Since the kernel of $p(θ│y)$ matches that of a $Beta(a,b)$ with $a=1+y$ and $b=1+n-y$, the constant needed in $p(θ│y)$ is $\frac{Γ(n+2)}{Γ(a+1)Γ(1+n-y)}$.  
+
<math>p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{{n\choose y} \theta^y (1-\theta)^{n-y}\times 1}{p(y)} \propto \theta^y (1-\theta)^{n-y}</math>, for <math>\theta\in [0,1]</math>.
 +
We can determine again the proportionality constant by using the fact that <math>p(\theta|y)</math> is a density and thus it needs to integrate to 1. We can see that the part of <math>p(\theta|y)</math> that depends on <math>\theta</math> has the same functional form as the kernel of a Beta distribution:
 +
<math>\text{Beta}(x;a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}  x^{a-1} (1-x)^{b-1},</math>
 +
where <math>x\in [0,1]</math>, and <math>a</math> and <math>b</math> are positive. Since the kernel of <math>p(\theta|y)</math> matches that of a <math>\text{Beta}(a,b)</math> with <math>a=1+y</math> and <math>b=1+n-y</math>, the constant needed in <math>p(\theta|y)</math> is <math>\frac{\Gamma(n+2)}{\Gamma(1+y)\Gamma(1+n-y)}</math>.
  
Summarizing, we have the distribution of $θ$ is $p(θ)=1$, \in [0,1]$ and $Y~p(y│θ)=Binomial (θ)$. Which implies that $p(θ│y)=Beta(1+y,1+n-y)$.
+
Summarizing, we have the distribution of <math>\theta</math> is <math>p(\theta)=1</math>, <math>\theta\in [0,1]</math> and <math>Y \sim p(y|\theta)=\text{Binomial}(n, \theta)</math>. This implies that <math>p(\theta|y)=\text{Beta}(1+y,1+n-y)</math>.
 
If $θ \sim p(θ)=Beta(a,b)$, $Y|θ \sim p(y│θ)=Binomial(θ)$, and we want to report a point estimate for $θ$ based on the data, we can then report its posterior mean. Since $p(θ│y)=Beta(\tilde{a}=a+y,\tilde{b}=b+n-y)$, using the formulas for the [[AP_Statistics_Curriculum_2007_Beta|mean of a Beta(a,b) random variable]], we have that the '''posterior mean''' is:
 
$$E(θ│y)=\frac{\tilde{a}}{\tilde{a}+\tilde{b}}=$$
 
$$\frac{a+y}{(a+y)+(b+n-y)}=\frac{a+y}{a+b+n}=\frac{a+b}{a+b+n}\times
 
\frac{a}{a+b}+\frac{n}{a+b+n}\frac{y}{n}=$$
 
$$\frac{a+b}{a+b+n}\times
 
\text{ prior mean} +\frac{n}{n+a+b}\times \text{sample average}.$$ 
 
  
The posterior mean is a weighted average of the prior mean and the sample average. The weight of the prior mean, $\frac{a+b}{a+b+n}$ is a ratio of the prior sample size to the sum of the prior sample size and the sample size of the data. If $n≫a+b$=prior sample size, then the weight of the prior mean becomes rather small and the majority of the information about $θ$ comes form the data and not from the prior. In this case: $E[θ│y]\approx \frac{y}{n}=\hat{y}$. The mode of the posterior distribution $p(θ│y)=Beta(a+y,b+n-y)$ is given by $Mode(θ│y)=\frac{a+y-1}{a+b+n-2}$. Also here note that if the sample size n is very large, the posterior model is approximately equal to y$\frac{y}{n}$, the sample average.  
+
If <math>\theta \sim p(\theta)=\text{Beta}(a,b)</math>, <math>Y|\theta \sim p(y|\theta)=\text{Binomial}(n, \theta)</math>, and we want to report a point estimate for <math>\theta</math> based on the data, we can then report its posterior mean. Since <math>p(\theta|y)=\text{Beta}(\tilde{a}=a+y,\tilde{b}=b+n-y)</math>, using the formulas for the [[AP_Statistics_Curriculum_2007_Beta|mean of a Beta(a,b) random variable]], we have that the '''posterior mean''' is:
 +
<math>E(\theta|y)=\frac{\tilde{a}}{\tilde{a}+\tilde{b}}=</math>
 +
<math>\frac{a+y}{(a+y)+(b+n-y)}=\frac{a+y}{a+b+n}=\frac{a+b}{a+b+n}\times
 +
\frac{a}{a+b}+\frac{n}{a+b+n}\frac{y}{n}=</math>
 +
<math>\frac{a+b}{a+b+n}\times
 +
\text{ prior mean} +\frac{n}{n+a+b}\times \text{sample average}.</math>
  
 +
The posterior mean is a weighted average of the prior mean and the sample average. The weight of the prior mean, <math>\frac{a+b}{a+b+n}</math> is a ratio of the prior sample size to the sum of the prior sample size and the sample size of the data. If <math>n \gg a+b</math> (prior sample size), then the weight of the prior mean becomes rather small and the majority of the information about <math>\theta</math> comes from the data and not from the prior. In this case: <math>E[\theta|y]\approx \frac{y}{n}=\hat{y}</math>. The mode of the posterior distribution <math>p(\theta|y)=\text{Beta}(a+y,b+n-y)</math> is given by <math>Mode(\theta|y)=\frac{a+y-1}{a+b+n-2}</math>. Also here note that if the sample size n is very large, the posterior mode is approximately equal to <math>\frac{y}{n}</math>, the sample average.
  
The posterior distribution $p(θ|y)$ can also be summarized by providing intervals for $θ$ of the form $[l(y),u(y)]$ that have a high posterior probability of containing $θ$. For example, $P(θ\in [l(y),u(y)]|y)=0.95$ has a 95% Bayesian coverage or is a 95% credible interval for $θ$. To compute the 95% CI for $θ$ by taking the 2.5<sup>th</sup> percentile and 97.5<sup>th</sup> percentile of $Beta(a+y,b+n-y)$. For the above example with $n=129$ and $y=118$, we have:
+
The posterior distribution <math>p(\theta|y)</math> can also be summarized by providing intervals for <math>\theta</math> of the form <math>[l(y),u(y)]</math> that have a high posterior probability of containing <math>\theta</math>. For example, <math>P(\theta\in [l(y),u(y)]|y)=0.95</math> has a 95% Bayesian coverage or is a 95% credible interval for <math>\theta</math>. To compute the 95% CI for <math>\theta</math> by taking the 2.5<sup>th</sup> percentile and 97.5<sup>th</sup> percentile of <math>\text{Beta}(a+y,b+n-y)</math>. For the above example with <math>n=129</math> and <math>y=118</math>, we have:
 
<center>
 
<center>
{| class="wikitable" style="text-align:center;"border="1"
+
{| class="wikitable" style="text-align:center;" border="1"
 
|-
 
|-
! Prior parameters $(a,b)$ || 95% Credible Interval for θ
+
! Prior parameters <math>(a,b)</math> || 95% Credible Interval for <math>\theta</math>
 
|-
 
|-
 
| (1,1) || [0.845, 0.951]
 
| (1,1) || [0.845, 0.951]
Line 171: Line 199:
 
* [http://www.sciencemag.org/content/294/5550/2310.short This article] discusses about Bayesian inference of Phylogeny and its impact on evolutionary biology. As a discipline, phylogenetics is becoming transformed by a flood of molecular data. These data allow broad questions to be asked about the history of life, but also present difficult statistical and computational problems. Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences.
 
* [http://www.sciencemag.org/content/294/5550/2310.short This article] discusses about Bayesian inference of Phylogeny and its impact on evolutionary biology. As a discipline, phylogenetics is becoming transformed by a flood of molecular data. These data allow broad questions to be asked about the history of life, but also present difficult statistical and computational problems. Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences.
  
===Software ===
+
===Software===
  
## 3D and 2D plots of the posterior probability that theta > some value w given the data in the  
+
<pre>
## Beta-Binomial model
+
## 3D and 2D plots of the posterior probability that theta > some value w given the data in the  
## The function that makes 3D plot is the persp function
+
## Beta-Binomial model
## For help and documentation, type ?persp at the prompt in R
+
## The function that makes 3D plot is the persp function
## The function persp takes as arguments: a x vector that gives the values on the x axis (this has
+
## For help and documentation, type ?persp at the prompt in R
## to be in increasing order), a y vector with values on the y axis (also in increasing order),
+
## The function persp takes as arguments: a x vector that gives the values on the x axis (this has
## and a z matrix of dimension equal to the length of the x vector times the length of the y vector.
+
## to be in increasing order), a y vector with values on the y axis (also in increasing order),
## The z matrix has at the (i,j) entry the value of the variable to be displayed on the third dimension
+
## and a z matrix of dimension equal to the length of the x vector times the length of the y vector.
## corresponding to the i-th value of the x-vector and the j-th value of the y-vector, that is z[i,j] is  
+
## The z matrix has at the (i,j) entry the value of the variable to be displayed on the third dimension
## the value of the z variable at x[i] and y[j].
+
## corresponding to the i-th value of the x-vector and the j-th value of the y-vector, that is z[i,j] is  
 +
## the value of the z variable at x[i] and y[j].
  
 
  a <- seq(0,10,by=0.1)
 
  a <- seq(0,10,by=0.1)
Line 214: Line 243:
 
   
 
   
 
  persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > 0.8 | y)",main="P(theta > 0.8 | y)", theta=30,phi=20)
 
  persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > 0.8 | y)",main="P(theta > 0.8 | y)", theta=30,phi=20)
 
  
=== Problems===
+
image(post.prob.theta)
 +
</pre>
 +
 
 +
===Problems===
 
====Sample survey====
 
====Sample survey====
Suppose we are going to sample 100 individuals from a country (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let $Y_i=1$ if person $I$ in the sample supports the polity and 0 otherwise.  
+
Suppose we are going to sample 100 individuals from a country (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let <math>Y_i=1</math> if person <math>i</math> in the sample supports the policy and 0 otherwise.
  
* Assume $Y_1,,Y_{100}$ are conditional on $θ$, IID binary random variables with expectation $θ$. Write down the joint distribution of $P⁡(Y_1=y_1,Y_2=y_2,,Y_{100}=y_{100}│θ)$ in a compact form. Also write down the form of $P⁡(\sum{Y_i=y}|θ)$.
+
* Assume <math>Y_1,\ldots,Y_{100}</math> are conditional on <math>\theta</math>, IID binary random variables with expectation <math>\theta</math>. Write down the joint distribution of <math>P(Y_1=y_1,Y_2=y_2,\ldots,Y_{100}=y_{100}|\theta)</math> in a compact form. Also write down the form of <math>P(\sum{Y_i=y}|\theta)</math>.
  
* Suppose you believed that \in {0.0,0.1,,0.9,1.0}$. Given that the results of the survey were $\sum_{i=1}^{100}{Y_i} =57$, compute $P⁡(\sum_{i=1}^{100}{Y_i} =57│θ)$ for each of these 11 values of $θ$ and plot these probabilities as a function of $θ$.
+
* Suppose you believed that <math>\theta\in \{0.0,0.1,\ldots,0.9,1.0\}</math>. Given that the results of the survey were <math>\sum_{i=1}^{100}{Y_i} =57</math>, compute <math>P(\sum_{i=1}^{100}{Y_i} =57|\theta)</math> for each of these 11 values of <math>\theta</math> and plot these probabilities as a function of <math>\theta</math>.
  
* Now suppose you originally had no prior information to believe one of these $θ$-values over another, and so $P(θ = 0.0) = P(θ = 0.1) = ... = P(θ=0.9)=P(θ= 1.0)$. Use Bayes’ rule to compute $p(θ| \sum_{i=1}^n{Y_i} = 57)$ for each $θ$-value. Make a plot of this posterior distribution as a function of $θ$.
+
* Now suppose you originally had no prior information to believe one of these <math>\theta</math>-values over another, and so <math>P(\theta = 0.0) = P(\theta = 0.1) = \cdots = P(\theta=0.9)=P(\theta= 1.0)</math>. Use Bayes’ rule to compute <math>p(\theta| \sum_{i=1}^n{Y_i} = 57)</math> for each <math>\theta</math>-value. Make a plot of this posterior distribution as a function of <math>\theta</math>.
  
* Next, suppose you allow $θ$ to be any value in the interval $[0,1]$. Using the uniform prior density for $θ$, so that $p(θ) = 1$, plot the posterior density $p(θ)\times P(\sum_{i=1}^n{Y_i} = 57|θ)$ as a function of $θ$.  
+
* Next, suppose you allow <math>\theta</math> to be any value in the interval <math>[0,1]</math>. Using the uniform prior density for <math>\theta</math>, so that <math>p(\theta) = 1</math>, plot the posterior density <math>p(\theta)\times P(\sum_{i=1}^n{Y_i} = 57|\theta)</math> as a function of <math>\theta</math>.
  
* As discussed above, the posterior distribution of $θ$ is $Beta(1+ 57,1 + 100 57)$. Plot the posterior density as a function of $θ$. Discuss the relationships among all of the plots you have made for this exercise.
+
* As discussed above, the posterior distribution of <math>\theta</math> is <math>\text{Beta}(1+57,1+100-57)</math>. Plot the posterior density as a function of <math>\theta</math>. Discuss the relationships among all of the plots you have made for this exercise.
  
 
===References===
 
===References===
 
* [http://en.wikipedia.org/wiki/Bayesian_inference  Wikipedia Bayesian Inference]
 
* [http://en.wikipedia.org/wiki/Bayesian_inference  Wikipedia Bayesian Inference]
 
+
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019178 Christou N, Dinov, ID (2011) Confidence Interval Based Parameter Estimation—A New SOCR Applet and Activity. PLoS ONE 6(5): e19178. doi:10.1371/journal.pone.0019178].
  
 
<hr>
 
<hr>
* SOCR Home page: http://www.socr.umich.edu
+
* SOCR Home page: https://socr.umich.edu
  
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_BayesianInference}}
+
{{translate|pageName=https://wiki.socr.umich.edu/index.php?title=SMHS_BayesianInference}}

Latest revision as of 09:16, 10 February 2026

Scientific Methods for Health Sciences - Bayesian Inference

Overview

Bayes’ theorem goes beyond connecting conditional and compound probability theory. It provides a method of inference where Bayes’ rule is used to update the probability estimate for a hypothesis as additional evidence is acquired. It is widely used in varieties of studies including medicine, law, science and engineering. In this section, we present a general introduction to the area of Bayesian inference and focus on its applications.

Motivation

We are already familiar with the concepts in frequentist statistics, which focuses on drawing conclusions and making statistical inference from sample data by the emphasis on the frequency or proportion of the data. However, what if we want to make inference in a different way where we make inference after taking into consideration of the observed data. That is we are going to introduce the concept of posterior distribution and inference we can make based on this. So, how does Bayesian inference work and how is it different from the traditional frequentist statistics?

Theory

Review our earlier discussion of Probability Theory.

Conditional Probability

The conditional probability of event \(A\) occurring given that event \(B\) occurs is \( P(A|B) = \frac{P(A \cap B)}{P(B)} \). If \(A\) and \(B\) are independent then knowing \(B\), or \(B^c\), gives no information on the probability of \(A\), i.e., \( P(A|B)=P(A) \).

Multiplication rule

For any two events, \(A\) and \(B\), \( P(A\cap B)=P(A|B)P(B) \). In general, for \(n\) events \(A_1, ..., A_n\)\[ P(A_1 \cap A_2 \cap A_3 \cap \cdots \cap A_n ) = P(A_1) P(A_2|A_1) P(A_3|A_1\cap A_2) \cdots P(A_n|A_1\cap A_2\cap A_3\cap \cdots \cap A_{n-1})\].

Law of total probability

If \( \{A_1,\ldots,A_n\} \) forms a partition of the sample space \(S\), then for every event \(B\), the probability of \(B\) to occur can be split into\[P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)+\cdots+P(B|A_n)P(A_n).\]

Inverting the order of conditioning

As event intersection is commutative, \( P(A \cap B) = P(A | B) \times P(B) = P(B | A) \times P(A) \).

Bayesian Rule

Combining the properties above, for any partition the sample space \(S\), \( \{A_1,\ldots,A_n\}\), and any two events \(A\) and \(B\) (i.e., subsets of \(S\)) we have\[P(A | B) = \frac{P(B | A) P(A)}{P(B)} = \frac{P(B | A) P(A)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + \cdots + P(B|A_n)P(A_n)}.\]

Interpretation

To interpret of some of the probability concepts above we can think of \(P(A)\) as a prior distribution of event \(A\), which is the initial degree of belief in \(A\) (before seeing any data/evidence). \(P(A|B)\) is referred to as the posterior distribution of event \(A\) given (the evidence) \(B\) representing the degree change of our initial belief having observed \(B\).

The Bayesian’s rule tells us that the posterior odds equal prior odds times Bayes factor. When arbitrarily many events \(A\) are of interest, not just two, the rule can be rephrased as posterior proportional to prior times likelihood, which can be expressed as \(P(A|B) \propto P(A)P(B|A)\), where the proportionality symbol (\(\propto\)) means that the left hand side is proportional to the right hand side as \(A\) varies given \(B\).

Given events \(A_1,A_2\) and \(B\), with Bayes’ rule we have that the conditional odds of \(A_1:A_2\) given \(B\) are equal to the marginal odds of \(A_1\) to \(A_2\) (i.e., \(A_1:A_2\)) multiplied by the Bayes factor or likelihood ratio \(\Lambda\). That is, \(O(A_1:A_2|B)=\Lambda(A_1: A_2|B) \times O(A_1:A_2)\), where \(\Lambda(A_1: A_2|B)=\frac{P(B|A_1)}{P(B|A_2)}\). The odds \(O(A_1: A_2 )=\frac{P(A_1 )}{P(A_2)}\), and conditional odds \(O(A_1: A_2|B)=\frac{P(A_1|B)}{P(A_2|B)}\) are known as prior odds and posterior odds respectively.

For the special case, where \(A_1=A^c\equiv A'\), this translates into using two competing hypotheses concerning the cause of \(B\). If we have \(O(A)=O(A:A^c)\), the odds on A is by definition the odds for and against \(A\). Bayes’ rule can then be written as \(O(A|B)=O(A)\Lambda(A|B)\), that is the posterior odds on \(A\) equals the prior odds on \(A\) times the likelihood ratio for \(A\) given \(B\) (posterior odds equal prior odds times likelihood ratio).

Statistical inference

The goal of statistical inference is to learn about the general population by taking a subset from that population. The steps in classical statistical parametric inference are:

  1. collect a random sample from the population;
  2. determine a probability model for the data \(y_1,...,y_n\) obtained (e.g., we are collecting data on the height of US women: each woman’s height can be thought of as a random variable with a normal distribution). This is called the sampling model;
  3. the sampling model will depend on a set of parameters, \(\theta\)\[p(y_i | \theta)\], for \(i=1,...,n\);
  4. using the sampling model, construct the likelihood function \(L(\theta; y)\);
  5. determine the maximum likelihood estimate (MLE) of \(\theta\) by maximizing the likelihood function \(L(\theta; y)\).

In the classical or frequentist approach to statistical inference, the parameters of a sampling model are assumed to be constant but unknown.

The Bayesian inference paradigm allows us to incorporate subjective prior beliefs regarding the parameter(s), \(\theta\) of the sampling model in our learning procedure. In Bayesian statistics, before we observe the data, we have some belief regarding possible values, or range of values for the set of parameters \(\theta\). This knowledge is expressed in the form of a prior distribution. Since we are assigning a distribution to \(\theta\), we are now assuming that \(\theta\) is a random variable (univariate if \(\theta\) is just a single parameter, multivariate if \(\theta\) consists of more than one parameter).

Since \(\theta\) is random, there is a sample space associated with \(\theta\). This parameter space is typically denoted by \(\Theta\). The parameter space \(\{\theta \in \Theta \}\) is the set of all possible values for the parameter.

Once data \(y_1,...,y_n\) have been collected, we specify a sampling model for the data \(p(y_i|\theta)\), \(i=1,\ldots,n\), which represents the probability of observing \(y_i\) if we knew \(\theta\). In light of the data observed, using Bayes’ theorem we update our belief about \(\theta\) and derive the posterior distribution \(p(\theta|y_1,...,y_n)=p(\theta|y)\), where \(y=\{y_1,...,y_n\}\).

Advantages of Bayesian Methods

Bayesian methods offer several key advantages over frequentist approaches in statistical inference. Below, we outline these benefits, highlighting how they address limitations in frequentist methods.

  • Interpretation: Bayesian inference provides a posterior distribution for the unknown parameter(s) \(\theta\), which allows for direct probabilistic statements about \(\theta\). For instance, suppose the parameter of interest is the population mean \(\theta\). After collecting data, a Bayesian 95% credible interval directly indicates that there is a 95% posterior probability that \(\theta\) lies within the interval. In contrast, a frequentist 95% confidence interval, such as \(\hat{y} \pm 1.96 \frac{s}{\sqrt{n}}\), does not represent the probability that \(\theta\) is in the interval after the data are observed—it either contains \(\theta\) or it does not. Instead, the 95% refers to the long-run coverage probability across repeated hypothetical samples.
  • Likelihood Principle: Bayesian inference adheres to the likelihood principle, which asserts that if two different sampling models produce proportional likelihood functions for \(\theta\) given the observed data, then inferences about \(\theta\) should be identical under both models. Frequentist methods, however, can violate this principle, leading to different conclusions from the same data depending on the sampling design. For example, consider a clinical trial where we observe 9 patients cured and 3 still sick. We wish to test the null hypothesis \(H_0: \theta = 1/2\) against the alternative \(H_a: \theta > 1/2\), where \(\theta\) is the probability of a successful cure (i.e., testing if the cure rate is better than 50/50).
 ** Model 1: Fixed number of trials (Binomial).

Let \(Y_1\) be the number of successful treatments out of 12 patients, modeled as Binomial(n=12, \theta). The observed data is \(Y_1 = 9\).

 ** Model 2: Stop when a fixed number of failures occur (Negative Binomial).

Let \(Y_2\) be the number of successful treatments observed before reaching 3 failures (sick patients), modeled as a Negative Binomial(r=3, \theta) (number of successes before r failures, with success probability \(\theta\)). The observed data is \(Y_2 = 9\).

The likelihood functions under both models are proportional (both involve \(\theta^9 (1-\theta)^3\) up to a constant), yet the frequentist p-values differ: approximately 0.073 for the binomial model and 0.033 for the negative binomial model. At a 5% significance level, the negative binomial model would reject \(H_0\), while the binomial model would not, despite the data being the same.

  • Sequential Learning and Updating: Bayes' theorem provides a coherent framework for updating beliefs with new data. If you have a posterior distribution \(p(\theta \mid y_1, \dots, y_n)\) from initial data \(y_1, \dots, y_n\), it can serve as the prior for incorporating a new observation \(y_{n+1}\), yielding an updated posterior \(p(\theta \mid y_1, \cdots, y_{n+1})\). This enables rational, incremental learning without starting from scratch each time.
  • No Reliance on Large Samples: Unlike many frequentist methods that depend on asymptotic approximations (e.g., normal approximations for valid inference), Bayesian inference is exact for any sample size. The posterior distribution is derived directly from the likelihood and prior, making it suitable for small-sample problems where frequentist methods may fail.
  • Incorporation of Prior Information: Bayesian methods allow the integration of prior knowledge or expert opinion into the analysis through the prior distribution, which can be particularly useful in fields with existing domain expertise, such as medicine or engineering.
  • Flexibility in Modeling: Bayesian approaches handle complex, hierarchical models and missing data more naturally via techniques like Markov Chain Monte Carlo (MCMC) simulation, without needing ad-hoc adjustments common in frequentist statistics.
  • Frequentist Methods as a Special Case: Bayesian inference can reproduce frequentist results under certain conditions, such as using non-informative (e.g., uniform) priors. In these cases, the posterior mode often coincides with the maximum likelihood estimate (MLE), and credible intervals may align with confidence intervals.

Criticisms and Considerations

While Bayesian methods have many strengths, they are not without challenges:

  • Prior Sensitivity: The choice of prior distribution can influence results, especially with limited data. Different analysts using different priors on the same data might reach different conclusions. To mitigate this, it is essential to perform sensitivity analyses, examining how results change under alternative priors. Objective or non-informative priors (e.g., Jeffreys priors) can help reduce subjectivity in some cases.

To further enhance this section, consider adding cross-references to related wiki pages (e.g., on priors, MCMC, or specific distributions) and including illustrative figures or simulations for the examples.

Bayes theorem for density functions

  • Marginal density \(p(y)=\int_{\theta \in \Theta}{p(y,\theta)d\theta}=\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}\), where \(\Theta\) is the parameter space. Substituting the expression for \(p(y)\), Bayes’ theorem for density function can be written as \(p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int_{\theta \in \Theta}{p(y|\theta)p(\theta)d\theta}}\). Given that \(p(\theta|y)\) is a function of \(\theta\), the denominator is a constant and thus, the posterior density of \(\theta\) is often expressed up to a proportionality constant as \(p(\theta|y) \propto p(y|\theta)p(\theta)\), where the sign \(\propto\) means “proportional to”.
The proportionality constant in the posterior density \(p(\theta|y)\) can be determined using the fact that the posterior density \(p(\theta|y)\) is a PDF. Suppose that we have observed \(Y=y\), and we want to evaluate and compare the odds of two possible numerical values for \(\theta\), then the probability (density) of \(\theta_a\) relative to \(\theta_b\) is\[\frac{p(\theta_a|y)}{p(\theta_b|y)} = \frac{\frac{p(y|\theta_a)p(\theta_a)}{p(y)}}{\frac{p(y|\theta_b)p(\theta_b)}{p(y)}}= \frac{p(y|\theta_a)p(\theta_a)}{p(y|\theta_b)p(\theta_b)},\]
where \(\frac{p(\theta_a)}{p(\theta_b)}\) is the prior odds and \(\frac{p(y|\theta_a)}{p(y|\theta_b)}\) is the likelihood ratio. In comparing the posterior probability of \(\theta_a\) and \(\theta_b\), it is not necessary to compute \(p(y)\).

Prior Distributions

How to choose a prior distribution \(p(\theta)\)? There are various types of prior distributions:

  • Non-informative priors: a prior distribution \(p(\theta)\) is non-informative if it has minimal impact on the posterior distribution \(p(\theta|y)\) of \(\theta\). In general, a non-informative prior is one which is dominated by the likelihood, that is, it is a prior that does not change very much over the region in which the likelihood is appreciable and does not assume large values outside that range. A prior that has these properties is said to be locally uniform. Non-informative priors are also called reference priors, vague priors or flat priors.
  • Examples:
    • If \(\theta \in \Theta = [0,1]\), then \(p(\theta) = \text{Uniform}(0,1)\) is a non-informative prior and \(p(\theta) = 1\) for \(\theta \in \Theta = [0,1]\).
    • If \(\theta \in \Theta = \mathbb{R}\), then if \(p(\theta) = \text{Normal}(\mu_0,\sigma_0^2)\) with \(\sigma_0^2 \longrightarrow \infty\), we get a non-informative prior. That is, we can pick \(\sigma_0^2\) large enough so that we obtain a non-informative prior.
  • Improper priors: a prior distribution \(p(\theta)\) on \(\theta \in \Theta\) is said to be improper if \(\int_{\theta \in \Theta}{p(\theta)d\theta}=\infty\). Improper priors are often used in Bayesian inference because they yield non-informative priors. For example: if \(\theta \in \Theta=\mathbb{R}\) and \(p(\theta) \propto 1\), then \(p(\theta)\) is an improper prior since clearly \(\int_{\theta \in \Theta}{p(\theta)d\theta}=\int_{-\infty}^{+\infty}{d\theta}\) is undefined. An improper prior \(p(\theta)\) can result in an improper posterior distribution \(p(\theta|y)\). One cannot make inference with improper posterior distributions! An improper prior \(p(\theta)\) may still lead to a proper posterior distribution. If \(p(\theta)\) is an improper prior, then the posterior distribution \(p(\theta|y)\) has always a particular form. In fact, if \(p(\theta)\) is improper then \(p(\theta) = c\) for some \(c > 0\) and \(p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}}=\frac{p(y|\theta)p(\theta)}{\int_{\theta\in\Theta}{p(y|\theta)cd\theta}} = \frac{1}{K} p(y|\theta)\), where \(K=\int_{\theta\in\Theta}{p(y|\theta)d\theta}\). That is, the posterior distribution, \(p(\theta|y)\), is just the renormalized likelihood.

One-parameter models: Binomial model

Consider data from a randomized controlled clinical trial evaluating a drug to reperfuse blocked blood vessels in patients with ischemic stroke. We indicate with \(Y_i\) whether the \(i^{th}\) patient treated with the drug had a favorable outcome (no disability at 90 days). Then the binary outcome variable is defined by\[ Y_i = \begin{cases} 1, & \text{if patient i has a favorable outcome} \\ 0, & \text{otherwise} \end{cases} .\]

Demonstration

Suppose that we have data from \(n = 129\) patients. Then, since the size \(N\) of the general population is much greater than the size \(n\) of the sample, \(Y_1,...,Y_n\) can be considered to be exchangeable random variables. Using the De Finetti’s theorem, we can set up the following model for our data: a prior distribution \(p(\theta)\) for a parameter \(\theta\), representing the proportion of 1’s in the population \(\left( \sum_{i=1}^N{\frac{Y_i}{N}}\right)\), given \(\theta\), the \(Y_1,...,Y_n\) are i.i.d.’s with \(p(y|\theta)=\theta^y (1-\theta)^{1-y}\), the sampling model adopted leads to \(p(y_1,...,y_n|\theta)=p(y_1,...,y_{129}|\theta)= \prod_{i=1}^{129}{\theta^{y_i}(1-\theta)^{1-y_i}}= \theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}\). Now we need to choose a prior distribution for \(\theta\).

The parameter \(\theta\) is a random variable with parameter space \(\Theta=[0,1]\). A first choice would be to use a locally uniform prior, that is a prior distribution \(p(\theta)\) that assigns the same probability to all intervals \((a,b) \subseteq [0,1]\) of the same length. This leads to \(p(\theta) \sim \text{Uniform}([0,1])\), i.e., \(p(\theta) = 1\) for \(\theta \in [0,1]\). Then, we obtain the following posterior distribution\[p(\theta|y_1,...,y_{129})=\frac{p(y_1,...,y_{129}|\theta)p(\theta)}{p(y_1,...,y_{129})}= p(y_1,...,y_{129}|\theta)\times \frac{1}{p(y_1,...,y_{129})} \propto p(y_1,...,y_{129}|\theta)\] since \(p(y_1,...,y_{129})\) does not depend on \(\theta\). Therefore, the posterior distribution \(p(\theta|y_1,...,y_{129})\) has the same shape as \(p(y_1,...,y_{129}|\theta)\) but they are on a different scale.

If of the \(n=129\) patients receiving the drug 118 had favorable outcomes and 11 did not, then we have \(\sum_{i=1}^{129}{y_i}=118\), so \(p(y_1,...,y_{129}|\theta)=\theta^{\sum_{i=1}^{129}{y_i}}(1-\theta)^{129-\sum_{i=1}^{129}{y_i}}=\theta^{118} (1-\theta)^{11}.\)

Recall that a random variable \(Y\) with sample space \(S=[0,1]\) is said to have a Beta distribution (a,b) if its PDF is given by \(p(y|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} y^{a-1} (1-y)^{b-1}\), with \(a,b>0\). For such a random variable, we have the following properties\[E[Y]=\frac{a}{a+b}\], \(Mode[Y]=\frac{a-1}{(a-1)+(b-1)}\), if \(a>1\) and \(b>1\), \(Var[Y]=\frac{ab}{(a+b+1) (a+b)^2} =\frac{E[Y](1-E[Y])}{a+b+1}\).

To determine the scale for the posterior distribution \(p(\theta|y_1,...,y_n)\) we use the fact that this is a conditional density and since it is a probability density function, it needs to integrate to 1, that is\[1=\int_{\theta\in \Theta}{p(\theta|y_1,...,y_{129})d\theta}=\int_0^1{\frac{\theta^{118}(1-\theta)^{11}}{p(y_1,...,y_{129})} d\theta}.\]

The following result from calculus is useful\[\int_0^1{\theta^{a-1} (1-\theta)^{b-1} d\theta}=\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\], where for any \(x>0\), the gamma function, \(\Gamma(x)\), can be looked up in a table or computed using SOCR or R using command gamma(). Thus, we have \(1=\int_0^1{\frac{\theta^{118} (1-\theta)^{11}}{p(y_1,...,y_{129})}d\theta}= \frac{1}{p(y_1,...,y_{129})} \int_0^1{\theta^{118} (1-\theta)^{11} d\theta}=\frac{1}{p(y_1,...,y_{129})} \frac{\Gamma(119)\Gamma(12)}{\Gamma(131)}.\) Hence, we have that \(p(\theta|y_1,...,y_{129})=\frac{\Gamma(131)}{\Gamma(119)\Gamma(12)} \theta^{118} (1-\theta)^{11}.\) That is, \(p(\theta|y_1,...,y_{129})\) has a Beta distribution with \(a=119\) and \(b=12\) parameters. Therefore \(E[\theta|y_1,...,y_{129}]=0.908\), \(Mode[\theta|y_1,...,y_{129}]=0.915\), and \(Var[\theta|y_1,...,y_{129}]=6.25\times 10^{-4}\).

  • Conclusion: In Bayesian inference: posterior information ≥ Prior information ≥ 0, where the second "≥" is just "=" only if the prior is non-informative.

Binomial model

Suppose the data \(Y_1,...,Y_n\) represent exchangeable binary random variables (i.e., a sequence such that future samples behave like earlier samples, meaning formally that any order (of a finite number of samples) is equally likely). For our choice of prior and sampling model, we have obtained \(p(\theta|y_1,...,y_n) \sim \text{Beta}(1+\sum_{i=1}^n{y_i}, 1+n-\sum_{i=1}^n{y_i})\). Now, let’s suppose we want to compare the posterior odds of two values \(\theta_a\) and \(\theta_b\) of \(\theta\) in light of an instance of the data \(y_1,...,y_n\).

The odds ratio is\[\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}} {\frac{\Gamma(n+2)}{\Gamma(1+\sum_{i=1}^n{y_i}) \Gamma(1+n-\sum_{i=1}^n{y_i})} {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}.\]

Simplifying, we get\[\frac{p(\theta_a|y_1,...,y_n)}{p(\theta_b|y_1,...,y_n)}=\frac{{\theta_a}^{\sum_{i=1}^n{y_i}} (1-\theta_a)^{n-\sum_{i=1}^n{y_i}}} { {\theta_b}^{\sum_{i=1}^n{y_i}} (1-\theta_b)^{n-\sum_{i=1}^n{y_i}}}= \left( \frac{\theta_a}{\theta_b}\right)^{\sum_{i=1}^n{y_i}} \left( \frac{1-\theta_a}{1-\theta_b}\right)^{n-\sum_{i=1}^n{y_i}}.\]

Hence, the posterior odds depend on the data only through \(\sum_{i=1}^n{y_i}\). In general, this is also true for the posterior probability, since \(p(\theta|y_1,...,y_n)=\text{Beta}(1+\sum_{i=1}^n{y_i} ,1+n-\sum_{i=1}^n{y_i})\). In other words, \(\sum_{i=1}^n{y_i}\) is a sufficient statistic.

Sufficient Statistics

If \(Y_1,...,Y_n\) are random variables, a statistic \(T(Y_1,...,Y_n)\) is a sufficient statistic if it captures all the information about \(\theta\) that is contained in the sample. If \(Y_1,...,Y_n\) are binary random variables that are conditionally independently and identically distributed with \(p(y|\theta)\) a Bernoulli distribution with parameter \(\theta\), then \(\sum_{i=1}^n{y_i}\) is a Binomial random variable and its density is \(p(y|\theta)={n \choose y} \theta^y (1-\theta)^{n-y}\), for \(y\in \{0,1,\ldots,n\}\).

What would be the posterior distribution of \(\theta\) if instead of working directly with the random variables \(Y_1,...,Y_n\), we work with the sufficient statistic \(Y=\sum_{i=1}^n{y_i}\)?

Here, \(\theta\in \Theta=[0,1]\) with prior distribution \(p(\theta)\). Let’s use again a uniform prior \(p(\theta)=1\) for \(\theta\), then \(p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)} =\frac{{n\choose y} \theta^y (1-\theta)^{n-y}\times 1}{p(y)} \propto \theta^y (1-\theta)^{n-y}\), for \(\theta\in [0,1]\). We can determine again the proportionality constant by using the fact that \(p(\theta|y)\) is a density and thus it needs to integrate to 1. We can see that the part of \(p(\theta|y)\) that depends on \(\theta\) has the same functional form as the kernel of a Beta distribution\[\text{Beta}(x;a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} x^{a-1} (1-x)^{b-1},\] where \(x\in [0,1]\), and \(a\) and \(b\) are positive. Since the kernel of \(p(\theta|y)\) matches that of a \(\text{Beta}(a,b)\) with \(a=1+y\) and \(b=1+n-y\), the constant needed in \(p(\theta|y)\) is \(\frac{\Gamma(n+2)}{\Gamma(1+y)\Gamma(1+n-y)}\).

Summarizing, we have the distribution of \(\theta\) is \(p(\theta)=1\), \(\theta\in [0,1]\) and \(Y \sim p(y|\theta)=\text{Binomial}(n, \theta)\). This implies that \(p(\theta|y)=\text{Beta}(1+y,1+n-y)\).

If \(\theta \sim p(\theta)=\text{Beta}(a,b)\), \(Y|\theta \sim p(y|\theta)=\text{Binomial}(n, \theta)\), and we want to report a point estimate for \(\theta\) based on the data, we can then report its posterior mean. Since \(p(\theta|y)=\text{Beta}(\tilde{a}=a+y,\tilde{b}=b+n-y)\), using the formulas for the mean of a Beta(a,b) random variable, we have that the posterior mean is\[E(\theta|y)=\frac{\tilde{a}}{\tilde{a}+\tilde{b}}=\] \(\frac{a+y}{(a+y)+(b+n-y)}=\frac{a+y}{a+b+n}=\frac{a+b}{a+b+n}\times \frac{a}{a+b}+\frac{n}{a+b+n}\frac{y}{n}=\) \(\frac{a+b}{a+b+n}\times \text{ prior mean} +\frac{n}{n+a+b}\times \text{sample average}.\)

The posterior mean is a weighted average of the prior mean and the sample average. The weight of the prior mean, \(\frac{a+b}{a+b+n}\) is a ratio of the prior sample size to the sum of the prior sample size and the sample size of the data. If \(n \gg a+b\) (prior sample size), then the weight of the prior mean becomes rather small and the majority of the information about \(\theta\) comes from the data and not from the prior. In this case\[E[\theta|y]\approx \frac{y}{n}=\hat{y}\]. The mode of the posterior distribution \(p(\theta|y)=\text{Beta}(a+y,b+n-y)\) is given by \(Mode(\theta|y)=\frac{a+y-1}{a+b+n-2}\). Also here note that if the sample size n is very large, the posterior mode is approximately equal to \(\frac{y}{n}\), the sample average.

The posterior distribution \(p(\theta|y)\) can also be summarized by providing intervals for \(\theta\) of the form \([l(y),u(y)]\) that have a high posterior probability of containing \(\theta\). For example, \(P(\theta\in [l(y),u(y)]|y)=0.95\) has a 95% Bayesian coverage or is a 95% credible interval for \(\theta\). To compute the 95% CI for \(\theta\) by taking the 2.5th percentile and 97.5th percentile of \(\text{Beta}(a+y,b+n-y)\). For the above example with \(n=129\) and \(y=118\), we have:

Prior parameters \((a,b)\) 95% Credible Interval for \(\theta\)
(1,1) [0.845, 0.951]
(2,1) [0.855, 0.952]
(3,1) [0.856, 0.952]
(1,2) [0.845, 0.946]
(2,2) [0.847, 0.947]

Applications

  • This article presents a generalization of Bayesian inference. Procedures of statistical inference are described which generalize Bayesian inference in specific ways. Probability is used in such a way that in general only bounds may be placed on the probabilities of given events, and probability systems of this kind are suggested both for sample information and for prior information. These systems are then combined using a specified rule. Illustrations are given for inferences about trinomial probabilities, and for inferences about a monotone sequence of binomial pi. Finally, some comments are made on the general class of models which produce upper and lower probabilities, and on the specific models which underlie the suggested inference procedures.
  • This article discusses about Bayesian inference of Phylogeny and its impact on evolutionary biology. As a discipline, phylogenetics is becoming transformed by a flood of molecular data. These data allow broad questions to be asked about the history of life, but also present difficult statistical and computational problems. Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences.

Software

## 3D and 2D plots of the posterior probability that theta > some value w given the data in the 
## Beta-Binomial model
## The function that makes 3D plot is the persp function
## For help and documentation, type ?persp at the prompt in R
## The function persp takes as arguments: a x vector that gives the values on the x axis (this has
## to be in increasing order), a y vector with values on the y axis (also in increasing order),
## and a z matrix of dimension equal to the length of the x vector times the length of the y vector.
## The z matrix has at the (i,j) entry the value of the variable to be displayed on the third dimension
## corresponding to the i-th value of the x-vector and the j-th value of the y-vector, that is z[i,j] is 
## the value of the z variable at x[i] and y[j].

 a <- seq(0,10,by=0.1)
 b <- seq(0,10,by=0.1)
 w <- 0.8
 n <- 129
 y <- 118
 ## Plotting the posterior distribution
 for(i in 1:length(a)){
 	for(j in 1:length(b)){
 		plot(seq(0,1,by=0.01),dbeta(seq(0,1,by=0.01),a[i]+y,b[j]+(n-y)),type="l",col="black",lty=2,lwd=1,
 			 xlim=c(0,1),xlab="Theta",ylab="P(theta | y)")
 		title(main=paste("Posterior distribution a=",a[i]," b=",b[j],sep=""))
 		abline(v=0.8,lty=3,col="red")
 		
 	}
 }

 post.prob.theta <- matrix(0,length(a),length(b))
 
 for(i in 1:length(a)){
 	for(j in 1:length(b)){
 		post.a <- a[i]+y
 		post.b <- b[j]+(n-y)
 		
 		post.prob.theta[i,j] <- 1-pbeta(w,post.a,post.b)
 	}
 }
 persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > w | y)")
 # To rotate the figure, you can specify the value of the angles theta and phi by which to tilt the figure
 
 persp(a,b,post.prob.theta,xlab="a",ylab="b",zlab="p(theta > 0.8 | y)",main="P(theta > 0.8 | y)", theta=30,phi=20)

image(post.prob.theta)

Problems

Sample survey

Suppose we are going to sample 100 individuals from a country (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let \(Y_i=1\) if person \(i\) in the sample supports the policy and 0 otherwise.

  • Assume \(Y_1,\ldots,Y_{100}\) are conditional on \(\theta\), IID binary random variables with expectation \(\theta\). Write down the joint distribution of \(P(Y_1=y_1,Y_2=y_2,\ldots,Y_{100}=y_{100}|\theta)\) in a compact form. Also write down the form of \(P(\sum{Y_i=y}|\theta)\).
  • Suppose you believed that \(\theta\in \{0.0,0.1,\ldots,0.9,1.0\}\). Given that the results of the survey were \(\sum_{i=1}^{100}{Y_i} =57\), compute \(P(\sum_{i=1}^{100}{Y_i} =57|\theta)\) for each of these 11 values of \(\theta\) and plot these probabilities as a function of \(\theta\).
  • Now suppose you originally had no prior information to believe one of these \(\theta\)-values over another, and so \(P(\theta = 0.0) = P(\theta = 0.1) = \cdots = P(\theta=0.9)=P(\theta= 1.0)\). Use Bayes’ rule to compute \(p(\theta| \sum_{i=1}^n{Y_i} = 57)\) for each \(\theta\)-value. Make a plot of this posterior distribution as a function of \(\theta\).
  • Next, suppose you allow \(\theta\) to be any value in the interval \([0,1]\). Using the uniform prior density for \(\theta\), so that \(p(\theta) = 1\), plot the posterior density \(p(\theta)\times P(\sum_{i=1}^n{Y_i} = 57|\theta)\) as a function of \(\theta\).
  • As discussed above, the posterior distribution of \(\theta\) is \(\text{Beta}(1+57,1+100-57)\). Plot the posterior density as a function of \(\theta\). Discuss the relationships among all of the plots you have made for this exercise.

References




Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif