Difference between revisions of "SMHS Epidemiology"
(→Probability in Pedigrees) |
(→Linkage Analysis) |
||
| Line 197: | Line 197: | ||
* '''Genetic Linkage''': Occurs when genes are close on the same chromosome, violating independent assortment. | * '''Genetic Linkage''': Occurs when genes are close on the same chromosome, violating independent assortment. | ||
* '''Recombination Fraction (θ)''': Proportion of gametes with recombination between two loci. | * '''Recombination Fraction (θ)''': Proportion of gametes with recombination between two loci. | ||
| − | + | : <math>\theta = 0.5</math>: No linkage (independent, >50 cM apart). | |
| − | + | : <math>\theta < 0.5</math>: Linkage (e.g., θ=0.1 means 10% recombination). | |
| − | + | : <math>\theta = 0</math>: Complete linkage (no recombination, syntenic loci). | |
| − | + | : Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA. | |
| + | |||
* '''Phases''': Coupling (alleles on same chromosome) vs. repulsion (opposite). | * '''Phases''': Coupling (alleles on same chromosome) vs. repulsion (opposite). | ||
| + | |||
* '''Applications''': Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene). | * '''Applications''': Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene). | ||
Revision as of 17:00, 24 January 2026
Contents
- 1 Scientific Methods for Health Sciences - Epidemiology
- 1.1 Overview
- 1.2 Motivation
- 1.3 Theory: The Human Genome and Mutation
- 1.4 Population Genetics
- 1.5 Pedigree Analysis and Inheritance
- 1.6 Linkage Analysis
- 1.7 Linkage Disequilibrium (LD) and Association
- 1.8 Gene-Environment Interactions
- 1.9 Core Epidemiological Measures
- 1.10 Applications and Software
- 1.11 Problems
- 1.12 References
Scientific Methods for Health Sciences - Epidemiology
Overview
Epidemiology is the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow's investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.
A core framework in epidemiology is the "epidemiologic triad" of agent, host, and environment, but contemporary approaches emphasize the "person, place, and time" dimensions:
- Person: Characteristics of individuals, such as age, sex, genetics, socioeconomic status, and behaviors that influence susceptibility or exposure.
- Place: Geographic variations, including urban vs. rural settings, climate, or access to healthcare, which can reveal environmental or social risk factors.
- Time: Temporal patterns, such as seasonal trends, secular changes over years, or sudden outbreaks, helping identify emerging threats or intervention effects.
This section delves into genetic epidemiology, bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.
Motivation
This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:
- Understand Genetic Foundations: Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tending to be inherited together unless separated by recombination).
- Analyze Population Dynamics: Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
- Evaluate Associations: Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
- Apply Computational Methods: Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
- Additional Skills: Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.
These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.
Theory: The Human Genome and Mutation
The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000–25,000 genes that orchestrate cellular functions, development, and responses to the environment. Mutations—changes in this DNA sequence—can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.
Chromosomal Structure
Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.
- Banding: Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
- Karyotype: The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
- Functional Elements:
* Centromeres: Central constrictions composed of repetitive alpha-satellite DNA sequences. They serve as attachment points for spindle fibers during mitosis and meiosis, ensuring proper chromosome segregation. Centromeric dysfunction can lead to aneuploidy (abnormal chromosome numbers). * Telomeres: Protective caps at chromosome ends, consisting of repetitive TTAGGG sequences (in humans) bound by shelterin proteins. Telomeres shorten with each somatic cell division due to the "end-replication problem," contributing to cellular aging (senescence). In germ cells and stem cells, telomerase enzyme maintains telomere length. Shortened telomeres are linked to diseases like dyskeratosis congenita and increased cancer risk. * Chromatin: The complex of DNA and proteins (histones) that forms chromosomes. It exists in two states: * Euchromatin: Loosely packed, transcriptionally active, and rich in genes and regulatory elements. * Heterochromatin: Tightly packed, transcriptionally silent, often containing repetitive DNA (e.g., satellites, transposons) near centromeres and telomeres. Epigenetic modifications (e.g., histone methylation) regulate chromatin states.
Mutations and Abnormalities
Mutations are heritable changes in the DNA sequence, occurring at a rate of about 10⁻⁸ per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.
- Structural Abnormalities (Large-scale changes visible under a microscope, affecting thousands to millions of base pairs):
* Deletion: Removal of a chromosomal segment, leading to loss of genes (e.g., cri-du-chat syndrome from 5p deletion). * Duplication: Extra copy of a segment, potentially causing gene dosage imbalances (e.g., Charcot-Marie-Tooth disease from PMP22 duplication). * Translocation: Exchange of segments between non-homologous chromosomes. Balanced translocations may be asymptomatic but increase risks in offspring; unbalanced ones cause disorders (e.g., chronic myeloid leukemia from t(9;22) "Philadelphia chromosome"). * Inversion: Reversal of a segment within a chromosome, which can disrupt genes or lead to abnormal recombination (e.g., hemophilia A inversions). * Other Types: Isochromosomes (duplicated arms, e.g., i(Xq) in Turner syndrome) or ring chromosomes (circular formations from deletions at both ends).
- Point Mutations (Small-scale changes at the nucleotide level, detected via sequencing):
* Nucleotide Substitution: Replacement of one base with another. * Silent: No amino acid change (due to codon degeneracy). * Missense: Amino acid change (e.g., sickle cell anemia from GAG to GTG in beta-globin gene). * Nonsense: Introduces premature stop codon, truncating the protein (e.g., cystic fibrosis). * Indels: Insertions or deletions of nucleotides. Small indels can cause frameshifts, altering the reading frame and producing dysfunctional proteins (e.g., Tay-Sachs disease). * Splice Site Variation: Mutations in intronic regions affecting mRNA splicing, leading to exon skipping or inclusion of introns (e.g., beta-thalassemia).
- Epidemiological Relevance: Mutations' population distribution informs disease prevalence. Rare mutations cause Mendelian disorders (e.g., Huntington's), while common variants (SNPs) contribute to complex traits via polygenic risk scores. Tools like next-generation sequencing (NGS) enable large-scale mutation detection in cohort studies.
Population Genetics
Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.
Allele and Genotype Frequencies
The gene pool is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.
- Allele Frequency: Proportion of a specific allele at a locus. For a biallelic locus, frequencies sum to 1.
\(\text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}.\)
- Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (2*40 + 50) / 200 = 0.65; q (a) = 0.35.
- Genotype Frequency: Proportion of individuals with a specific genotype (e.g., P(AA) = number of AA / total individuals).
- Applications: Used in GWAS to compare frequencies between cases and controls; deviations can indicate associations.
Hardy-Weinberg Equilibrium (HWE)
HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p)\[P(AA) = p^2\] (homozygous dominant)
\(P(Aa) = 2pq\) (heterozygous)
\(P(aa) = q^2\) (homozygous recessive)
- Deviations from HWE: Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.
- Testing for HWE: Use Chi-squared goodness-of-fit test.
- Formula\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},\]
df=1 for biallelic loci.
- Critical value: >3.84 for p<0.05 (reject HWE).
- Example: Observed: 400 AA, 500 Aa, 100 aa (total 1000). p=0.65, q=0.35. Expected: 422.5 AA, 455 Aa, 122.5 aa. χ² ≈ 10.5 (p<0.05, deviation).
R Implementation for HWE: This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the HardyWeinberg package.
# Hardy-Weinberg Equilibrium Test
# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
obs <- c(AA = 400, Aa = 500, aa = 100) # Example observed counts
# Input validation
if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")
total <- sum(obs)
if (total == 0) stop("No individuals in population")
# Calculate allele frequencies
p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total) # Frequency of A
q <- 1 - p # Frequency of a
# Expected counts under HWE
expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)
# Chi-squared test (ensure expected >5 for validity)
if (any(expected < 5)) warning("Expected counts <5; consider exact test")
chi2 <- sum((obs - expected)^2 / expected)
df <- 1 # Degrees of freedom for biallelic
p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)
# Optional: Fisher's exact test for small samples (using contingency table simulation)
# But for simplicity, we use chi-squared here
# Output results
cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")
cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")
cat("Chi-squared =", round(chi2, 4), "df =", df, "p-value =", format.pval(p_val_chi), "\n")
if (p_val_chi < 0.05) {
cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
} else {
cat("Fail to reject HWE: Population appears in equilibrium.\n")
}
Pedigree Analysis and Inheritance
Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.
Modes of Inheritance
Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.
- Autosomal Dominant (AD):
* Definition: A single copy of the dominant allele (D) is sufficient to express the phenotype. Affected individuals are heterozygous (Dd) or homozygous (DD), while dd are unaffected. * Pedigree Characteristics: Vertical transmission (affected individuals in every generation); no skipped generations if fully penetrant. Affects males and females equally; 50% risk to offspring of an affected parent. * Example: Huntington's disease (CAG repeat expansion in HTT gene on chromosome 4). Late-onset, neurodegenerative disorder with high penetrance. * Epidemiological Notes: Often rare; founder effects in isolated populations increase prevalence. Used in linkage studies to map genes. * Limitations: Incomplete penetrance or variable expressivity (different symptom severity) can mimic other patterns.
- Autosomal Recessive (AR):
* Definition: Requires two copies of the recessive allele (dd) for expression. Heterozygotes (Dd) are asymptomatic carriers. * Pedigree Characteristics: Horizontal transmission (affected siblings from unaffected parents); skips generations. Consanguinity (e.g., cousin marriages) increases risk; equal male-female incidence. * Example: Cystic fibrosis (mutations in CFTR gene on chromosome 7). Affects lung and digestive functions; carrier frequency ~1/25 in Caucasians. * Epidemiological Notes: Higher prevalence in populations with high carrier rates (e.g., due to heterozygote advantage, like sickle cell anemia in malaria-endemic areas). Screening programs target carriers. * Limitations: Phenocopies (environmental mimics) can complicate diagnosis.
- X-Linked Recessive (XR):
* Definition: Mutation on the X chromosome; males (X^c Y) express with one copy due to hemizygosity, while females (X^C X^c) are carriers (often unaffected due to X-inactivation). * Pedigree Characteristics: No male-to-male transmission; affected males pass to all daughters (carriers) but no sons. Mother-to-son transmission; more males affected. * Example: Hemophilia A (mutations in F8 gene on Xq28). Bleeding disorder; historical prevalence in European royalty. * Epidemiological Notes: Skewed sex ratios in affected individuals; carrier testing in families. Lyonization (random X-inactivation) can cause mild symptoms in females. * Limitations: De novo mutations can appear without family history.
- Additional Modes:
* X-Linked Dominant (XD): Rare; affects females more (e.g., Rett syndrome, MECP2 gene). Lethal in males or mild; female-to-offspring transmission. * Y-Linked: Male-only transmission (e.g., azoospermia factors on Y chromosome). Rare, as Y has few genes. * Mitochondrial: Maternal inheritance (mtDNA from mother); affects both sexes but no father-to-child. Variable due to heteroplasmy (mixed mtDNA). Example: Leber's hereditary optic neuropathy.
Probability in Pedigrees
Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for Bayesian risk assessment in genetic counseling.
- Key Formula: The likelihood of observing a pedigree under a genetic model is\[P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i),\]
- where the product is over all individuals, \(P(\text{genotype}_i)\) is based on Mendelian laws and parental genotypes, and \(P(\text{phenotype}_i | \text{genotype}_i)\) accounts for penetrance.
- Penetrance: Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.
- Variable Expressivity: Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).
- Example: In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.
- Applications: Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).
- Limitations: Assumes accurate family history; ignores modifiers like environment.
Linkage Analysis
Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.
- Genetic Linkage: Occurs when genes are close on the same chromosome, violating independent assortment.
- Recombination Fraction (θ): Proportion of gametes with recombination between two loci.
\[\theta = 0.5\]: No linkage (independent, >50 cM apart). \[\theta < 0.5\]: Linkage (e.g., θ=0.1 means 10% recombination). \[\theta = 0\]: Complete linkage (no recombination, syntenic loci). : Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.
- Phases: Coupling (alleles on same chromosome) vs. repulsion (opposite).
- Applications: Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).
LOD Score
The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.
- Formula\[Z(\theta) = \log_{10} \frac{L(\text{data} | \theta=\hat{\theta})}{L(\text{data} | \theta=0.5)}\], where \(\hat{\theta}\) is the maximum likelihood estimate of θ, and L is the likelihood function.
- Calculation: Involves summing over possible phases and recombinants in pedigrees.
- Interpretation: Positive Z favors linkage; threshold Z≥3 for significance (false positive rate 5%); Z≤-2 excludes linkage.
| LOD Score | Odds Ratio | Interpretation |
|---|---|---|
| Z ≤ -2 | ≤1:100 | Strong evidence against linkage (exclusion) |
| -2 < Z < 3 | Indeterminate | Suggestive but not conclusive |
| Z ≥ 3 | ≥1000:1 | Strong evidence for linkage (genome-wide significance) |
- Example: In a family with a disease and marker, if data is 1000 times more likely under θ=0.1 than 0.5, Z=3.
- Limitations: Requires large pedigrees; sensitive to model misspecification (e.g., wrong penetrance). Superseded by GWAS for complex diseases.
Linkage Disequilibrium (LD) and Association
LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.
- Causes of LD: Mutation, selection, genetic drift, population admixture, or bottlenecks. High LD in isolated populations (e.g., Ashkenazi Jews).
- Decay: LD decreases with genetic distance and time; measured in haplotype blocks.
- Vs. Linkage: Linkage is family-based (meiotic); LD is population-based (historical).
- Applications: Imputation in GWAS; fine-mapping; inferring evolutionary history.
Measures of LD
Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.
- D (Disequilibrium Coefficient):
* Formula\[D_{AB} = p_{AB} - p_A p_B\].
* Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.
* Limitations: Sensitive to allele frequencies; not comparable across loci.
- D' (Normalized D):
* Formula\[D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))} if D<0, or \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))} if D>0\].
* Interpretation: Ranges -1 to 1; |D'|=1: Complete LD (no recombination evidence); |D'|<1: Partial.
* Use: Detects historical recombination.
- r² (Correlation Coefficient):
* Formula\[r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}\].
* Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).
* Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.
Additional Considerations
- Haplotypes: Combinations of alleles (e.g., estimated via EM algorithm in PHASE software).
- Visualization: LD plots (triangular heatmaps) using tools like Haploview.
- Best Practices: Account for population structure (e.g., via principal components) to avoid spurious LD. In epidemiology, LD informs polygenic risk scores.
- Limitations: LD varies by ancestry (e.g., higher in Europeans than Africans); tagging fails for rare variants.
R Implementation for LD: This code computes D, D', and r² from example haplotype frequencies or counts. It uses base R for simplicity; for real data, consider packages like genetics or snpStats.
# Linkage Disequilibrium Measures
# Input: Haplotype counts (e.g., from population data)
# Assume biallelic loci A/a, B/b
count_AB <- 120 # AB haplotypes
count_Ab <- 80 # Ab
count_aB <- 60 # aB
count_ab <- 140 # ab
total <- sum(c(count_AB, count_Ab, count_aB, count_ab))
# Haplotype frequencies
p_AB <- count_AB / total
p_Ab <- count_Ab / total
p_aB <- count_aB / total
p_ab <- count_ab / total
# Allele frequencies
p_A <- p_AB + p_Ab
p_a <- 1 - p_A
p_B <- p_AB + p_aB
p_b <- 1 - p_B
# D
D <- p_AB - p_A * p_B
# D' (normalized)
D_max_pos <- min(p_A * p_b, p_a * p_B)
D_max_neg <- -min(p_A * p_B, p_a * p_b)
D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)
# r-squared
r_sq <- D^2 / (p_A * p_a * p_B * p_b)
# Output
cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")
cat("D =", round(D, 4), "\n")
cat("D' =", round(D_prime, 4), "\n")
cat("r-squared =", round(r_sq, 4), "\n")
if (abs(D_prime) == 1) cat("Complete LD detected.\n") else cat("Partial or no LD.\n")
Genome-Wide Association Studies (GWAS)
GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.
- Statistical Model: Typically uses logistic regression for case-control studies.
\[\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}.\]
- Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., \(p < 5 \times 10^{-8}\)) are required to avoid false positives.
Gene-Environment Interactions
Disease risk is often modeled as a combination of genetics (\(G\)), environment (\(E\)), and their interaction (\(G \times E\)). \[Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon.\]
Interaction Models:
- Synergistic: Genotype exacerbates the risk factor (or vice versa).
- Independent: Both factors influence risk but do not interact.
Core Epidemiological Measures
In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.
Absolute Risk Reduction (ARR)
- Definition: The difference in event rates (incidences) between a control (or unexposed) group and a treatment (or exposed) group. ARR measures the absolute effect of an intervention or exposure on outcome risk.
- Formula\[ARR = I_{\text{control}} - I_{\text{treatment}}\], where \(I\) represents incidence (proportion of events).
- Interpretation: A positive ARR indicates risk reduction (benefit); a negative ARR indicates increased risk (harm). It is straightforward but does not account for baseline risk.
- Example: If the incidence of heart attacks is 10% in the control group and 7% in the treatment group, ARR = 0.10 - 0.07 = 0.03 (3% absolute reduction).
- When to Use: Prospective studies like randomized controlled trials (RCTs); useful for communicating tangible benefits to patients.
- Limitations: Sensitive to baseline risk; not ideal for comparing across populations with different event rates.
Relative Risk (RR)
- Definition: The ratio of the incidence of an outcome in the exposed group to that in the unexposed group. RR assesses how much an exposure increases or decreases the probability of an event.
- Formula\[RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}\].
- Interpretation: RR > 1 indicates increased risk due to exposure; RR < 1 indicates protective effect; RR = 1 indicates no association. It is multiplicative and accounts for baseline risk.
- Example: In a cohort study, smokers have a 20% lung cancer incidence, while non-smokers have 2%. RR = 0.20 / 0.02 = 10 (smokers are 10 times more likely to develop lung cancer).
- When to Use: Cohort studies or RCTs; preferred for common outcomes.
- Limitations: Can overestimate associations for rare events; not applicable in case-control studies.
Odds Ratio (OR)
- Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group. OR approximates RR when the outcome is rare.
- Formula: From a 2x2 contingency table (a = exposed cases, b = exposed non-cases, c = unexposed cases, d = unexposed non-cases)\[OR = \frac{ad}{bc}\].
- Interpretation: OR > 1 suggests positive association; OR < 1 suggests inverse association; OR = 1 suggests no association. It is often used in logistic regression.
- Example: In a case-control study of diabetes and obesity: 80 obese diabetics (a), 20 obese non-diabetics (b), 30 non-obese diabetics (c), 70 non-obese non-diabetics (d). OR = (80*70) / (20*30) = 9.33 (obesity increases odds of diabetes by over 9 times).
- When to Use: Case-control studies or when incidence data is unavailable; common in meta-analyses.
- Limitations: Not directly interpretable as risk for common outcomes; can differ from RR if events are frequent.
Number Needed to Treat (NNT) or Harm (NNH)
- Definition: The average number of patients who need to be treated (or exposed) to prevent (or cause) one additional outcome compared to the control. NNT is based on ARR and translates statistical effects into clinical relevance.
- Formula\[NNT = \frac{1}{|ARR|}\] (use absolute value for magnitude; sign of ARR determines benefit vs. harm).
- Interpretation: Lower NNT indicates greater treatment efficacy. If ARR > 0, it's NNT (benefit); if ARR < 0, it's NNH (harm). Infinite NNT means no effect.
- Example (Benefit): ARR = 0.03 (as above), NNT = 1 / 0.03 ≈ 33.3 (treat 33 patients to prevent one heart attack).
- Example (Harm): If treatment increases incidence from 50% to 80%, ARR = 0.50 - 0.80 = -0.30, NNH = 1 / 0.30 ≈ 3.3 (treat 3 patients to cause one additional bad outcome).
- When to Use: RCTs or systematic reviews; helps in shared decision-making and cost-benefit analysis.
- Limitations: Assumes constant ARR; sensitive to time frame and baseline risk. Confidence intervals should be reported for real-world application.
Additional Considerations
- Confidence Intervals (CI): Always compute 95% CIs for these measures to assess precision (e.g., using bootstrap methods or formulas in R packages like epitools or epiR).
- Attributable Risk (AR): Extends RR; AR = \(I_{\text{exposed}} - I_{\text{unexposed}}\) (absolute risk due to exposure).
- Population Attributable Risk (PAR): PAR = \(I_{\text{population}} - I_{\text{unexposed}}\) (proportion of cases attributable to exposure in the population).
- Best Practices: Adjust for confounders using multivariable models; interpret in context (e.g., RR may seem large for rare events but have small absolute impact).
- Software Tools: R (with packages like epiR, Epi, or survival for advanced metrics like Hazard Ratios) or Python (with scipy or lifelines) are commonly used.
R Implementation for Key Measures: This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.
# Install if needed: install.packages("epiR") # But assuming it's available or use base R
# Example data: 2x2 table for OR/RR (cohort study assumption)
# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
a <- 20 # Exposed cases
b <- 80 # Exposed non-cases
c <- 2 # Unexposed cases
d <- 98 # Unexposed non-cases
# Incidences
I_exposed <- a / (a + b)
I_unexposed <- c / (c + d)
# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
ARR <- I_unexposed - I_exposed # Positive if treatment reduces risk
# Relative Risk
RR <- I_exposed / I_unexposed
# Odds Ratio
OR <- (a * d) / (b * c)
# Number Needed to Treat/Harm
NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))
# Output
cat("Incidence Exposed:", round(I_exposed, 3), "\n")
cat("Incidence Unexposed:", round(I_unexposed, 3), "\n")
cat("ARR:", round(ARR, 3), "\n")
cat("RR:", round(RR, 3), "\n")
cat("OR:", round(OR, 3), "\n")
cat(type, ":", ifelse(is.finite(NNT), round(NNT, 1), "Infinite"), "\n")
# Harm example (swap for treatment increasing risk)
I_control <- 0.50
I_treatment <- 0.80
ARR_harm <- I_control - I_treatment
NNT_harm <- 1 / abs(ARR_harm)
cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")
cat("NNH:", round(NNT_harm, 1), "\n")
Applications and Software
Modern epidemiology relies heavily on computational tools.
- Key R Packages: epiR (Epi measures), genetics (HWE, LD), survival (time-to-event), qqman (GWAS visualization).
- Online Tools: SOCR Distribution Tables.
Problems
Problem 1: Linkage Mapping
Scenario: Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction \(\theta\).
1. Calculate LOD Scores: Using the Maximum Likelihood Estimation (MLE), if the phase is unknown, we average likelihoods. If \(\theta=0.1\), and calculating for a specific phase arrangement: \[L(\theta) = (1-\theta)^4 \theta\] (based on 4 non-recombinants, 1 recombinant). \[Z(\theta) = \log_{10}\frac{L(\theta)}{L(0.5)}\] 2. Result Table:
| \(\theta\) | \(Z(\theta)\) |
| 0.0 | \(-\infty\) |
| 0.10 | 0.022 |
| 0.20 | 0.124 (Max LOD) |
| 0.50 | 0.0 |
The maximum LOD score occurs at \(\hat{\theta} = 0.20\).
Problem 2: NNT Calculation
Scenario: A trial shows 800/1000 events in Treatment Group A and 600/1200 events in Control Group B. 1. \(p_A = 0.80\), \(p_B = 0.50\). 2. \(NNT = \frac{1}{p_B - p_A} = \frac{1}{0.5 - 0.8} = -3.33\). Interpretation: Since the value is negative, this represents a Number Needed to Harm (NNH) of 3.3. For every ~3-4 patients treated, one additional adverse event occurs compared to the control.
Problem 3: GWAS Power Analysis (R)
Scenario: Simulate a study with 500 cases/500 controls, Minor Allele Frequency (MAF) = 0.2, OR = 1.5.
simulate_gwas_power <- function(n_cases, n_controls, maf, OR, alpha = 0.05, n_sims = 100) {
significant <- numeric(n_sims)
n_total <- n_cases + n_controls
for (i in 1:n_sims) {
geno <- rbinom(n_total, 2, maf) # Generate Genotypes
beta <- log(OR)
# Logistic model simulation
log_odds <- -2 + beta * (geno - mean(geno))
prob <- plogis(log_odds)
status <- rbinom(n_total, 1, prob)
# Test
model <- glm(status ~ geno, family = binomial)
p_val <- summary(model)$coefficients[2, 4]
significant[i] <- as.numeric(p_val < alpha)
}
return(mean(significant))
}
References
- Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott, 2008.
- Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, 2013.
- Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2010.
- SOCR Home page: http://www.socr.umich.edu
Translate this page: