Difference between revisions of "SMHS Epidemiology"

Latest revision as of 15:49, 17 February 2026

Scientific Methods for Health Sciences - Epidemiology

Overview

Epidemiology is the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow's investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.

A core framework in epidemiology is the "epidemiologic triad" of agent, host, and environment, but contemporary approaches emphasize the "person, place, and time" dimensions:

Person: Characteristics of individuals, such as age, sex, genetics, socioeconomic status, and behaviors that influence susceptibility or exposure.
Place: Geographic variations, including urban vs. rural settings, climate, or access to healthcare, which can reveal environmental or social risk factors.
Time: Temporal patterns, such as seasonal trends, secular changes over years, or sudden outbreaks, helping identify emerging threats or intervention effects.

This section delves into genetic epidemiology, bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.

Motivation

This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:

Understand Genetic Foundations: Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tending to be inherited together unless separated by recombination).
Analyze Population Dynamics: Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
Evaluate Associations: Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
Apply Computational Methods: Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
Additional Skills: Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.

These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.

Theory: The Human Genome and Mutation

The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000–25,000 genes that orchestrate cellular functions, development, and responses to the environment. Mutations, changes in this DNA sequence, can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.

Figure 1: Illustration of human chromosomal structure, highlighting key features like centromeres, telomeres, and banding patterns.

Chromosomal Structure

Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.

Banding: Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
Karyotype: The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
Functional Elements:

 * Centromeres: Central constrictions composed of repetitive alpha-satellite DNA sequences. They serve as attachment points for spindle fibers during mitosis and meiosis, ensuring proper chromosome segregation. Centromeric dysfunction can lead to aneuploidy (abnormal chromosome numbers).
 * Telomeres: Protective caps at chromosome ends, consisting of repetitive TTAGGG sequences (in humans) bound by shelterin proteins. Telomeres shorten with each somatic cell division due to the "end-replication problem," contributing to cellular aging (senescence). In germ cells and stem cells, telomerase enzyme maintains telomere length. Shortened telomeres are linked to diseases like dyskeratosis congenita and increased cancer risk.
 * Chromatin: The complex of DNA and proteins (histones) that forms chromosomes. It exists in two states:
   * Euchromatin: Loosely packed, transcriptionally active, and rich in genes and regulatory elements.
   * Heterochromatin: Tightly packed, transcriptionally silent, often containing repetitive DNA (e.g., satellites, transposons) near centromeres and telomeres. Epigenetic modifications (e.g., histone methylation) regulate chromatin states.

Mutations and Abnormalities

Mutations are heritable changes in the DNA sequence, occurring at a rate of about 10⁻⁸ per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.

Figure 2: Schematic of common structural chromosomal abnormalities, including deletion, duplication, translocation, and inversion.

Structural Abnormalities (Large-scale changes visible under a microscope, affecting thousands to millions of base pairs):

 * Deletion: Removal of a chromosomal segment, leading to loss of genes (e.g., cri-du-chat syndrome from 5p deletion).
 * Duplication: Extra copy of a segment, potentially causing gene dosage imbalances (e.g., Charcot-Marie-Tooth disease from PMP22 duplication).
 * Translocation: Exchange of segments between non-homologous chromosomes. Balanced translocations may be asymptomatic but increase risks in offspring; unbalanced ones cause disorders (e.g., chronic myeloid leukemia from t(9;22) "Philadelphia chromosome").
 * Inversion: Reversal of a segment within a chromosome, which can disrupt genes or lead to abnormal recombination (e.g., hemophilia A inversions).
 * Other Types: Isochromosomes (duplicated arms, e.g., i(Xq) in Turner syndrome) or ring chromosomes (circular formations from deletions at both ends).

Point Mutations (Small-scale changes at the nucleotide level, detected via sequencing):

 * Nucleotide Substitution: Replacement of one base with another.
   * Silent: No amino acid change (due to codon degeneracy).
   * Missense: Amino acid change (e.g., sickle cell anemia from GAG to GTG in beta-globin gene).
   * Nonsense: Introduces premature stop codon, truncating the protein (e.g., cystic fibrosis).
 * Indels: Insertions or deletions of nucleotides. Small indels can cause frameshifts, altering the reading frame and producing dysfunctional proteins (e.g., Tay-Sachs disease).
 * Splice Site Variation: Mutations in intronic regions affecting mRNA splicing, leading to exon skipping or inclusion of introns (e.g., beta-thalassemia).

Epidemiological Relevance: Mutations' population distribution informs disease prevalence. Rare mutations cause Mendelian disorders (e.g., Huntington's), while common variants (SNPs) contribute to complex traits via polygenic risk scores. Tools like next-generation sequencing (NGS) enable large-scale mutation detection in cohort studies.

Population Genetics

Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.

Allele and Genotype Frequencies

The gene pool is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.

Allele Frequency: Proportion of a specific allele at a locus. For a biallelic locus, frequencies sum to 1.

$\text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}.$

Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (2*40 + 50) / 200 = 0.65; q (a) = (2*10 + 50) / 200 = 0.35.

Genotype Frequency: Proportion of individuals with a specific genotype (e.g., P(AA) = number of AA / total individuals).
Applications: Used in GWAS to compare frequencies between cases and controls; deviations can indicate associations.

Hardy-Weinberg Equilibrium (HWE)

HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p):

\[P(AA) = p^2\ (homozygous\ dominant)\]

\[P(Aa) = 2pq\ (heterozygous)\]

\[P(aa) = q^2\ (homozygous\ recessive)\]

Deviations from HWE: Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.

Testing for HWE: Use Chi-squared goodness-of-fit test.

Formula\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},\]

df=1 for biallelic loci.

Critical value: >3.84 for p<0.05 (reject HWE).

Example:

Observed: 400 AA, 500 Aa, 100 aa (total 1000). p=0.65, q=0.35.

Expected: 422.5 AA, 455 Aa, 122.5 aa.

χ² ≈ 10.5 (p<0.05, deviation).

R Implementation for HWE: This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the HardyWeinberg package.

# Hardy-Weinberg Equilibrium Test
# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
obs <- c(AA = 400, Aa = 500, aa = 100)  # Example observed counts

# Input validation
if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")

total <- sum(obs)
if (total == 0) stop("No individuals in population")

# Calculate allele frequencies
p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total)  # Frequency of A
q <- 1 - p  # Frequency of a

# Expected counts under HWE
expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)

# Chi-squared test (ensure expected >5 for validity)
if (any(expected < 5)) warning("Expected counts <5; consider exact test")

chi2 <- sum((obs - expected)^2 / expected)
df <- 1  # Degrees of freedom for biallelic
p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)

# Optional: Fisher's exact test for small samples (using contingency table simulation)
# But for simplicity, we use chi-squared here

# Output results
cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")
cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")
cat("Chi-squared =", round(chi2, 4), "df =", df, "p-value =", format.pval(p_val_chi), "\n")
if (p_val_chi < 0.05) {
  cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
} else {
  cat("Fail to reject HWE: Population appears in equilibrium.\n")
}

Pedigree Analysis and Inheritance

Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.

Figure 5: Example pedigree illustrating inheritance patterns, with symbols for affected individuals, carriers, and relationships across generations.

Modes of Inheritance

Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.

Autosomal Dominant (AD):

 * Definition: A single copy of the dominant allele (D) is sufficient to express the phenotype. Affected individuals are heterozygous (Dd) or homozygous (DD), while dd are unaffected.
 * Pedigree Characteristics: Vertical transmission (affected individuals in every generation); no skipped generations if fully penetrant. Affects males and females equally; 50% risk to offspring of an affected parent.
 * Example: Huntington's disease (CAG repeat expansion in HTT gene on chromosome 4). Late-onset, neurodegenerative disorder with high penetrance.
 * Epidemiological Notes: Often rare; founder effects in isolated populations increase prevalence. Used in linkage studies to map genes.
 * Limitations: Incomplete penetrance or variable expressivity (different symptom severity) can mimic other patterns.

Autosomal Recessive (AR):

 * Definition: Requires two copies of the recessive allele (dd) for expression. Heterozygotes (Dd) are asymptomatic carriers.
 * Pedigree Characteristics: Horizontal transmission (affected siblings from unaffected parents); skips generations. Consanguinity (e.g., cousin marriages) increases risk; equal male-female incidence.
 * Example: Cystic fibrosis (mutations in CFTR gene on chromosome 7). Affects lung and digestive functions; carrier frequency ~1/25 in Caucasians.
 * Epidemiological Notes: Higher prevalence in populations with high carrier rates (e.g., due to heterozygote advantage, like sickle cell anemia in malaria-endemic areas). Screening programs target carriers.
 * Limitations: Phenocopies (environmental mimics) can complicate diagnosis.

X-Linked Recessive (XR):

 * Definition: Mutation on the X chromosome; males (X^c Y) express with one copy due to hemizygosity, while females (X^C X^c) are carriers (often unaffected due to X-inactivation).
 * Pedigree Characteristics: No male-to-male transmission; affected males pass to all daughters (carriers) but no sons. Mother-to-son transmission; more males affected.
 * Example: Hemophilia A (mutations in F8 gene on Xq28). Bleeding disorder; historical prevalence in European royalty.
 * Epidemiological Notes: Skewed sex ratios in affected individuals; carrier testing in families. Lyonization (random X-inactivation) can cause mild symptoms in females.
 * Limitations: De novo mutations can appear without family history.

Additional Modes:

 * X-Linked Dominant (XD): Rare; affects females more (e.g., Rett syndrome, MECP2 gene). Lethal in males or mild; female-to-offspring transmission.
 * Y-Linked: Male-only transmission (e.g., azoospermia factors on Y chromosome). Rare, as Y has few genes.
 * Mitochondrial: Maternal inheritance (mtDNA from mother); affects both sexes but no father-to-child. Variable due to heteroplasmy (mixed mtDNA). Example: Leber's hereditary optic neuropathy.

Probability in Pedigrees

Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for Bayesian risk assessment in genetic counseling.

Key Formula: The likelihood of observing a pedigree under a genetic model is\[P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i),\]

where the product is over all individuals, $P(\text{genotype}_i)$ is based on Mendelian laws and parental genotypes, and $P(\text{phenotype}_i | \text{genotype}_i)$ accounts for penetrance.

Penetrance: Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.

Variable Expressivity: Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).

Example: In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.

Applications: Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).

Limitations: Assumes accurate family history; ignores modifiers like environment.

Linkage Analysis

Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.

Genetic Linkage: Occurs when genes are close on the same chromosome, violating independent assortment.
Recombination Fraction (θ): Proportion of gametes with recombination between two loci.

\[\theta = 0.5\]: No linkage (independent, >50 cM apart). \[\theta < 0.5\]: Linkage (e.g., θ=0.1 means 10% recombination). \[\theta = 0\]: Complete linkage (no recombination, syntenic loci).

Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.

Phases: Coupling (alleles on same chromosome) vs. repulsion (opposite).

Applications: Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).

LOD Score

The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.

Formula\[Z(\theta) = \log_{10} \frac{L(\text{data} | \theta=\hat{\theta})}{L(\text{data} | \theta=0.5)}\], where $\hat{\theta}$ is the maximum likelihood estimate of θ, and L is the likelihood function.
Calculation: Involves summing over possible phases and recombinants in pedigrees.
Interpretation: Positive Z favors linkage; threshold Z≥3 for significance (false positive rate 5%); Z≤-2 excludes linkage.

LOD Score	Odds Ratio	Interpretation
Z ≤ -2	≤1:100	Strong evidence against linkage (exclusion)
-2 < Z < 3	Indeterminate	Suggestive but not conclusive
Z ≥ 3	≥1000:1	Strong evidence for linkage (genome-wide significance)

Example: In a family with a disease and marker, if data is 1000 times more likely under θ=0.1 than 0.5, Z=3.
Limitations: Requires large pedigrees; sensitive to model misspecification (e.g., wrong penetrance). Superseded by GWAS for complex diseases.

Linkage Disequilibrium (LD) and Association

LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.

Causes of LD: Mutation, selection, genetic drift, population admixture, or bottlenecks. High LD in isolated populations (e.g., Ashkenazi Jews).
Decay: LD decreases with genetic distance and time; measured in haplotype blocks.
Vs. Linkage: Linkage is family-based (meiotic); LD is population-based (historical).
Applications: Imputation in GWAS; fine-mapping; inferring evolutionary history.

Measures of LD

Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.

D (Disequilibrium Coefficient):

Formula\[D_{AB} = p_{AB} - p_A p_B.\]

Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.

Limitations: Sensitive to allele frequencies; not comparable across loci.

D' (Normalized D):

Formula: if $D<0$, $D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))}$

if $D>0$, $D' = \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))}$.

Interpretation: Ranges -1 to 1; |D'|=1: Complete LD (no recombination evidence); |D'|<1: Partial.

Use: Detects historical recombination.

r² (Correlation Coefficient):

Formula\[r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}\].

Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).

Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.

Additional Considerations

Haplotypes: Combinations of alleles (e.g., estimated via EM algorithm in PHASE software).
Visualization: LD plots (triangular heatmaps) using tools like Haploview.
Best Practices: Account for population structure (e.g., via principal components) to avoid spurious LD. In epidemiology, LD informs polygenic risk scores.
Limitations: LD varies by ancestry (e.g., higher in Europeans than Africans); tagging fails for rare variants.

R Implementation for LD: This code computes D, D', and r² from example haplotype frequencies or counts. It uses base R for simplicity; for real data, consider packages like genetics or snpStats.

# Linkage Disequilibrium Measures
# Input: Haplotype counts (e.g., from population data)
# Assume biallelic loci A/a, B/b
count_AB <- 120  # AB haplotypes
count_Ab <- 80   # Ab
count_aB <- 60   # aB
count_ab <- 140  # ab
total <- sum(c(count_AB, count_Ab, count_aB, count_ab))

# Haplotype frequencies
p_AB <- count_AB / total
p_Ab <- count_Ab / total
p_aB <- count_aB / total
p_ab <- count_ab / total

# Allele frequencies
p_A <- p_AB + p_Ab
p_a <- 1 - p_A
p_B <- p_AB + p_aB
p_b <- 1 - p_B

# D
D <- p_AB - p_A * p_B

# D' (normalized)
D_max_pos <- min(p_A * p_b, p_a * p_B)
D_max_neg <- -min(p_A * p_B, p_a * p_b)
D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)

# r-squared
r_sq <- D^2 / (p_A * p_a * p_B * p_b)

# Output
cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")
cat("D =", round(D, 4), "\n")
cat("D' =", round(D_prime, 4), "\n")
cat("r-squared =", round(r_sq, 4), "\n")
if (abs(D_prime) == 1) cat("Complete LD detected.\n") else cat("Partial or no LD.\n")

Genome-Wide Association Studies (GWAS)

GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.

Statistical Model: Typically uses logistic regression for case-control studies.

\[\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}.\]

Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., $p < 5 \times 10^{-8}$) are required to avoid false positives.

Gene-Environment Interactions

Disease risk is often modeled as a combination of genetics ($G$), environment ($E$), and their interaction ($G \times E$). \[Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon.\]

Interaction Models:

Synergistic: Genotype exacerbates the risk factor (or vice versa).
Independent: Both factors influence risk but do not interact.

Model: Genotype exacerbates the effect of the risk factor

Core Epidemiological Measures

In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.

Absolute Risk Reduction (ARR)

Definition: The difference in event rates (incidences) between a control (or unexposed) group and a treatment (or exposed) group. ARR measures the absolute effect of an intervention or exposure on outcome risk.
Formula\[ARR = I_{\text{control}} - I_{\text{treatment}}\], where $I$ represents incidence (proportion of events).
Interpretation: A positive ARR indicates risk reduction (benefit); a negative ARR indicates increased risk (harm). It is straightforward but does not account for baseline risk.
Example: If the incidence of heart attacks is 10% in the control group and 7% in the treatment group, ARR = 0.10 - 0.07 = 0.03 (3% absolute reduction).
When to Use: Prospective studies like randomized controlled trials (RCTs); useful for communicating tangible benefits to patients.
Limitations: Sensitive to baseline risk; not ideal for comparing across populations with different event rates.

Relative Risk (RR)

Definition: The ratio of the incidence of an outcome in the exposed group to that in the unexposed group. RR assesses how much an exposure increases or decreases the probability of an event.
Formula\[RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}\].
Interpretation: RR > 1 indicates increased risk due to exposure; RR < 1 indicates protective effect; RR = 1 indicates no association. It is multiplicative and accounts for baseline risk.
Example: In a cohort study, smokers have a 20% lung cancer incidence, while non-smokers have 2%. RR = 0.20 / 0.02 = 10 (smokers are 10 times more likely to develop lung cancer).
When to Use: Cohort studies or RCTs; preferred for common outcomes.
Limitations: Can overestimate associations for rare events; not applicable in case-control studies.

Odds Ratio (OR)

A 2×2 contingency table with odds ratio formula.

**2 × 2 Contingency Table for Odds Ratio**
	Exposed	Unexposed	Total
Cases (Diseased)	a	c	a + c
Controls (Healthy)	b	d	b + d
Total	a + b	c + d	N
Odds Ratio (OR) Calculation: $$OR = \frac{a / c}{b / d} = \frac{ad}{bc}$$

Formula\[OR = \frac{ad}{bc}\]

Odds Ratio = [Exposed Cases × Unexposed Non-cases] / [Exposed Non-cases × Unexposed Cases].

Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group. OR approximates RR when the outcome is rare.
Formula: From a 2x2 contingency table (a = exposed cases, b = exposed non-cases, c = unexposed cases, d = unexposed non-cases)\[OR = \frac{ad}{bc}\].
Interpretation: OR > 1 suggests positive association; OR < 1 suggests inverse association; OR = 1 suggests no association. It is often used in logistic regression.
Example: In a case-control study of diabetes and obesity: 80 obese diabetics (a), 20 obese non-diabetics (b), 30 non-obese diabetics (c), 70 non-obese non-diabetics (d). OR = (80*70) / (20*30) = 9.33 (obesity increases odds of diabetes by over 9 times).
When to Use: Case-control studies or when incidence data is unavailable; common in meta-analyses.
Limitations: Not directly interpretable as risk for common outcomes; can differ from RR if events are frequent.

Number Needed to Treat (NNT) or Harm (NNH)

Definition: The average number of patients who need to be treated (or exposed) to prevent (or cause) one additional outcome compared to the control. NNT is based on ARR and translates statistical effects into clinical relevance.
Formula\[NNT = \frac{1}{|ARR|}\] (use absolute value for magnitude; sign of ARR determines benefit vs. harm).
Interpretation: Lower NNT indicates greater treatment efficacy. If ARR > 0, it's NNT (benefit); if ARR < 0, it's NNH (harm). Infinite NNT means no effect.
Example (Benefit): ARR = 0.03 (as above), NNT = 1 / 0.03 ≈ 33.3 (treat 33 patients to prevent one heart attack).
Example (Harm): If treatment increases incidence from 50% to 80%, ARR = 0.50 - 0.80 = -0.30, NNH = 1 / 0.30 ≈ 3.3 (treat 3 patients to cause one additional bad outcome).
When to Use: RCTs or systematic reviews; helps in shared decision-making and cost-benefit analysis.
Limitations: Assumes constant ARR; sensitive to time frame and baseline risk. Confidence intervals should be reported for real-world application.

Additional Considerations

Confidence Intervals (CI): Always compute 95% CIs for these measures to assess precision (e.g., using bootstrap methods or formulas in R packages like epitools or epiR).
Attributable Risk (AR): Extends RR; AR = $I_{\text{exposed}} - I_{\text{unexposed}}$ (absolute risk due to exposure).
Population Attributable Risk (PAR): PAR = $I_{\text{population}} - I_{\text{unexposed}}$ (proportion of cases attributable to exposure in the population).
Best Practices: Adjust for confounders using multivariable models; interpret in context (e.g., RR may seem large for rare events but have small absolute impact).
Software Tools: R (with packages like epiR, Epi, or survival for advanced metrics like Hazard Ratios) or Python (with scipy or lifelines) are commonly used.

R Implementation for Key Measures: This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.

# Install if needed: install.packages("epiR")  # But assuming it's available or use base R

# Example data: 2x2 table for OR/RR (cohort study assumption)
# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
a <- 20  # Exposed cases
b <- 80  # Exposed non-cases
c <- 2   # Unexposed cases
d <- 98  # Unexposed non-cases

# Incidences
I_exposed <- a / (a + b)
I_unexposed <- c / (c + d)

# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
ARR <- I_unexposed - I_exposed  # Positive if treatment reduces risk

# Relative Risk
RR <- I_exposed / I_unexposed

# Odds Ratio
OR <- (a * d) / (b * c)

# Number Needed to Treat/Harm
NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))

# Output
cat("Incidence Exposed:", round(I_exposed, 3), "\n")
cat("Incidence Unexposed:", round(I_unexposed, 3), "\n")
cat("ARR:", round(ARR, 3), "\n")
cat("RR:", round(RR, 3), "\n")
cat("OR:", round(OR, 3), "\n")
cat(type, ":", ifelse(is.finite(NNT), round(NNT, 1), "Infinite"), "\n")

# Harm example (swap for treatment increasing risk)
I_control <- 0.50
I_treatment <- 0.80
ARR_harm <- I_control - I_treatment
NNT_harm <- 1 / abs(ARR_harm)
cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")
cat("NNH:", round(NNT_harm, 1), "\n")

The following code uses a custom function to handle both benefit (reduction in risk) and harm (increase in risk) scenarios dynamically.

# Function to calculate Epidemiological Metrics
calculate_epi_stats <- function(a, b, c, d) {
# Input validation
if (any(c(a, b, c, d) < 0)) stop("Cell counts must be non-negative")

# Basic Incidences
i_exp <- a / (a + b)
i_unexp <- c / (c + d)

# Absolute Risk Reduction (ARR)
# Defined as Control Risk - Treated Risk
arr <- i_unexp - i_exp

# Relative Risk (RR) and Odds Ratio (OR)
# Adding small constant (0.5) if zero to avoid division by zero errors
rr <- i_exp / i_unexp
or <- (a * d) / (b * c)

# NNT / NNH Logic
nnt_val <- if (arr != 0) 1 / abs(arr) else Inf
nnt_label <- ifelse(arr > 0, "NNT (Benefit)",
ifelse(arr < 0, "NNH (Harm)", "No Effect"))

# Return formatted list
return(list(
Incidence_Exposed = round(i_exp, 4),
Incidence_Unexposed = round(i_unexp, 4),
ARR = round(arr, 4),
RR = round(rr, 2),
OR = round(or, 2),
Measure = nnt_label,
Value = round(nnt_val, 1)
))
}

# --- Example 1: Beneficial Treatment ---
# (e.g., Vaccine trial: Exposed = Vaccinated)

result_benefit <- calculate_epi_stats(a=20, b=80, c=40, d=60)
print(result_benefit)

# --- Example 2: Harmful Exposure ---
# (e.g., Smoking vs Lung Cancer)
result_harm <- calculate_epi_stats(a=50, b=50, c=10, d=90)
print(result_harm)

To understand the output and interpret the results, we use the following logic.

Metric	Interpretation	Calculation
ARR	The actual difference in risk between groups.	$I_{unexposed} - I_{exposed}$
RR	How many times more likely the outcome is in the exposed group.	$I_{exposed} / I_{unexposed}$
OR	The ratio of the odds of the outcome occurring in the exposed group.	$\frac{a \times d}{b \times c}$
NNT	The number of patients needed to treat to prevent one bad outcome.	$1 / ARR$

Note on NNT/NNH: If the ARR is positive (risk decreased), we report Number Needed to Treat (NNT). If the ARR is negative (risk increased), we report Number Needed to Harm (NNH).

Confidence Intervals: In a real-world study, we typically report 95% CIs. Using `prop.test()` in base R to get CIs for risk differences.

Log Transformation: For OR and RR, calculations for CIs are typically done on the scale to ensure symmetry.

Rmd/R Epidemiological simulation of a hypothetical viral epidemic

We'll try to illustrates genomics tracking, prediction, and spatio-temporal quantification of a viral epidemic. For this, we'll build a high-fidelity epidemiological simulation in R, integrating three distinct domains: compartmental modeling (SIR/SEIR), phylodynamics (genomic tracking), and spatial movement.

To keep this self-contained and realistic, we will use a Stochastic Agent-Based Model (ABM) approach. This allows us to track individual "viral sequences" as they mutate and spread across a coordinate system.

The R simulation requires the following libraries:

tidyverse: For data manipulation and plotting.

gganimate: For spatio-temporal visualization.

adegenet or ape: For handling genomic distances.

sf: for projecting this onto real-world maps.

The Core Simulation Engine

This script generates a synthetic population, simulates transmission based on proximity, and introduces mutations (modeled as bit-strings or distance matrices).

library(tidyverse)
library(gganimate)

# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.05  # Probability of a genomic shift per transmission
timesteps <- 50

# --- Initialize Population ---
pop <- tibble(
  id = 1:n_agents,
  x = runif(n_agents, 0, map_size),
  y = runif(n_agents, 0, map_size),
  state = "S",
  genome = 0,        # Base viral lineage
  parent_id = NA,
  infect_time = NA
)

# Infect patient zeros
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0
pop$genome[1:initially_infected] <- sample(1:100, initially_infected)

# --- Run Simulation ---
history <- list()

for (t in 1:timesteps) {
  # 1. Movement (Random Walk)
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
  
  # 2. Transmission
  infected <- which(pop$state == "I")
  susceptible <- which(pop$state == "S")
  
  for (i in infected) {
    # Find neighbors in range
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
    contacts <- susceptible[dists < 5] # Transmission radius
    
    if (length(contacts) > 0) {
      for (target in contacts) {
        if (runif(1) < 0.3) { # Prob of infection
          pop$state[target] <- "I"
          pop$infect_time[target] <- t
          pop$parent_id[target] <- pop$id[i]
          # Genomic Mutation: Slight drift from parent genome
          pop$genome[target] <- pop$genome[i] + 
            sample(c(-1, 0, 1), 1, prob = c(mutation_rate/2, 1-mutation_rate, mutation_rate/2))
        }
      }
    }
  }
  
  # 3. Recovery
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)

Here is the structure of the simulated dataset.

> str(sim_data )
tibble [10,000 × 8] (S3: tbl_df/tbl/data.frame)
 $ id         : int [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
 $ x          : num [1:10000] 89 92.5 28 83.5 65.4 ...
 $ y          : num [1:10000] 86.5 50.2 85.2 47.1 16.4 ...
 $ state      : chr [1:10000] "I" "I" "I" "S" ...
 $ genome     : num [1:10000] 80 53 24 0 0 0 0 0 0 0 ...
 $ parent_id  : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ infect_time: num [1:10000] 0 0 0 NA NA NA NA NA NA NA ...
 $ time       : int [1:10000] 1 1 1 1 1 1 1 1 1 1 ...

Genomic Tracking & Phylogeny

To visualize the "Genomic Tracking," we treat the genome value as a proxy for a sequence. In a real scenario, this would be a high-dimensional Hamming distance, but here we can visualize the Clade Divergence over time.

Visualizing the Spatio-Temporal Spread: Using gganimate, we can see how the virus clusters geographically and how the "Genomic Signature" (color) evolves.

anim <- ggplot(sim_data, aes(x = x, y = y, color = as.factor(genome), size = state)) +
  geom_point(alpha = 0.7) +
  scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
  theme_minimal() +
  labs(title = "Viral Spread & Genomic Drift: Time {frame}",
       color = "Genomic Lineage",
       size = "Infection Status") +
  transition_time(time)

# animate(anim) # Uncomment to render

Realistic Epidemiological Quantification

To make this "highly realistic," we should calculate the Effective Reproduction Number ($R_t$) and the Spatio-Temporal Kernel 3. Realistic Epidemiological Quantification

$R_t=\frac{\text{New Infections at time}\ t}{\text{Infectious Individuals at time}\ t−1}.$

Quantifying the Clusters: We can use a K-nearest neighbor (KNN) or DBSCAN algorithm on the sim_data to identify hotspots

where specific genomic lineages are dominating. This mimics real-world Genomic Surveillance reports.

# Example quantification of lineage prevalence
library(dplyr)
library(ggplot2)

lineage_summary_plot <- sim_data %>%
    filter(state == "I") %>%
    # Use group_by to organize the data by time and genome
    group_by(time, genome) %>% 
    summarise(count = n(), .groups = "drop") %>%
    ggplot(aes(x = time, y = count, fill = as.factor(genome))) +
    geom_area() +
    labs(title = "Genomic Lineage Succession", x = "Days", y = "Active Cases")

# Display the plot
lineage_summary_plot

To extend this code into an Rmd notebook, we can include

 1. Parameter Sweeping: Wrap the simulation in a function and run it with different  values.
 2. Network Analysis: Use the `parent_id` column to create a transmission tree using the igraph package.
 3. Kernel Density: Add a geom_density_2d() layer to the map to show spatial intensity.

Let's expand the simulation from a simple proxy to a high-fidelity Phylodynamic Model by simulating a founder DNA sequence and allow it to evolve through point mutations (substitutions) as it spreads across the population.

DNA Sequences & Mutation: Instead of a single number, each agent now carries a character string. When transmission occurs, there is a probability that one or more bases in the sequence will mutate.

library(tidyverse)
library(igraph)
library(ggraph)
library(stringdist) # For Hamming Distance

# --- Configuration ---
seq_length <- 50
bases <- c("A", "C", "G", "T")
mutation_rate <- 0.01 # Probability per base per transmission

# Create a starting sequence (Patient Zero)
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")

# Helper function to mutate DNA
mutate_dna <- function(sequence) {
  seq_vec <- strsplit(sequence, "")[[1]]
  mutation_mask <- runif(seq_length) < mutation_rate
  if(any(mutation_mask)) {
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
  }
  return(paste(seq_vec, collapse = ""))
}

Genetic Distance Matrix: To quantify the "genomic tracking" aspect, we use the Hamming Distance, which counts the number of positions at which the corresponding nucleotides are different.

# Extract all unique sequences found in the simulation
unique_genomes <- unique(pop$genome)

# Calculate Hamming Distance Matrix
dist_matrix <- stringdistmatrix(unique_genomes, unique_genomes, method = "hamming")
rownames(dist_matrix) <- unique_genomes
colnames(dist_matrix) <- unique_genomes

# This matrix allows us to see how far 'Variant B' has drifted from 'Variant A'

Integrated Spatio-Temporal & Transmission Model: We will update the simulation to record the transmission pairs to build the `igraph` tree and add the spatial density layer.

# --- Initialization Fixes ---
# Ensure genome column is a character vector to hold DNA strings
pop$genome <- as.character(pop$genome) 
edges <- data.frame(from = integer(), to = integer())

for (t in 1:timesteps) {
  # 1. Movement
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
  
  # 2. Transmission
  infected <- which(pop$state == "I")
  susceptible <- which(pop$state == "S")
  
  for (i in infected) {
    # Check distances against only the susceptible population
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
    contacts <- susceptible[dists < 5] 
    
    for (target in contacts) {
      # Important: Check if target is still susceptible (in case they were infected
      # by someone else earlier in this same timestep loop)
      if (pop$state[target] == "S" && runif(1) < 0.3) { 
        
        # Update Status
        pop$state[target] <- "I"
        pop$infect_time[target] <- t
        pop$parent_id[target] <- pop$id[i]
        
        # 1. Expand Genomic Logic: DNA Mutation
        # We replace the numeric addition with our string mutation function
        new_sequence <- mutate_dna(pop$genome[i]) 
        pop$genome[target] <- new_sequence
        
        # 3. Track Transmission for igraph
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
      }
    }
  }
  
  # 3. Recovery
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)

# --- Visualization: Spatial Intensity ---
ggplot(sim_data %>% filter(state == "I"), aes(x = x, y = y)) +
  geom_density_2d_filled(
    aes(fill = after_stat(level)),
    alpha = 0.6,
    contour_var = "ndensity"   # optional: normalize for better contrast
  ) +
  geom_point(aes(color = genome), alpha = 0.9) +
  scale_fill_viridis_d() +      # discrete fill for the bands
  theme_minimal() +
  labs(title = "Viral Hotspots and Genomic Clustering") +
  theme(legend.position = "none")

The Transmission Tree (Phylogeny): Using igraph and ggraph, we can visualize the

Who-Infects-Whom network. This is the gold standard for contact tracing and understanding super-spreading events.

# Create graph object
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

# Plot the tree
ggraph(transmission_graph, layout = 'tree') + 
  geom_edge_diagonal(alpha = 0.2, color = "grey") + 
  geom_node_point(color = "steelblue", size = 2) + 
  theme_void() +
  labs(title = "Reconstructed Transmission Tree")

# An alternative plot_ly() SVG infection graph can be created by
library(plotly)
library(igraph)
library(dplyr)  # For piping and joins

# Extract layout coordinates (tree structure)
layout <- layout_as_tree(transmission_graph)

# Prepare node data with safe column names
node_data <- data.frame(
  node_id = V(transmission_graph)$name,  # Renamed from 'id'
  x = layout[,1],
  y = layout[,2]
)

# Calculate in-degree (number of incoming edges) to identify root nodes
in_degree <- degree(transmission_graph, mode = "in")

# Create nodes dataframe with size based on root status
nodes <- node_data %>%
  mutate(
    size = ifelse(in_degree == 0, 10, 7),  # Root nodes (in-degree=0) are larger
    label = ifelse(in_degree == 0, "Root", "Descendant")
  )

# Prepare edge data with directional arrows
edges <- as_data_frame(transmission_graph, what = "edges") %>%
  left_join(nodes, by = c("from" = "node_id")) %>%
  left_join(nodes, by = c("to" = "node_id"), suffix = c("_from", "_to"))

# 1. Calculate children for each node
children_info <- edges %>%
  group_by(from) %>%
  summarise(children = paste(to, collapse = ", ")) %>%
  rename(node_id = from)

# 2. Calculate parents for each node
parent_info <- edges %>%
  group_by(to) %>%
  summarise(parents = paste(from, collapse = ", ")) %>%
  rename(node_id = to)

# 3. Join this info back to your nodes dataframe
nodes <- nodes %>%
  left_join(children_info, by = "node_id") %>%
  left_join(parent_info, by = "node_id") %>%
  mutate(
    children = ifelse(is.na(children), "None", children),
    parents = ifelse(is.na(parents), "None", parents)
  )

# Create interactive plot
transmission_plot <- plot_ly() %>%
  add_segments(
    data = edges,
    x = ~x_from, y = ~y_from,
    xend = ~x_to, yend = ~y_to,
    line = list(
      color = ~ifelse(x_from < x_to, "#1f77b4", "#ff7f0e"),
      width = 2
    ),
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = nodes,
    x = ~x,
    y = ~y,
    type = "scatter",
    mode = "markers",
    marker = list(
      size = ~size,
      opacity = 0.8,
      line = list(width = 1, color = "black")
    ),
    # Added Parents and Children to the hover text
    text = ~paste0(
      "<b>Node ID:</b> ", node_id, 
      "<br><b>Type:</b> ", label,
      "<br><b>Parent(s):</b> ", parents,
      "<br><b>Children:</b> ", children
    ),
    hoverinfo = "text",
    showlegend = FALSE
  ) %>%
  layout(
    title = "Interactive Viral Transmission Tree (Graph)",
    xaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
    yaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
    hovermode = "closest"
  )

transmission_plot

Genetic Distance: By calculating (distance between sequence and ), we can identify if a new "cluster" appearing in a specific city is a local evolution or a new introduction from outside.

Spatial Kernel: The geom_density_2d visualizes the spatial transmission kernel, identifying where the force of infection is highest.

Tree Topology: A star-like phylogeny suggests rapid exponential growth, while a ladder-like

phylogeny suggests a stable, endemic transmission chain.

# Proper Genetic Distance Matrix
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")

# Optional: Visualize distance as a heatmap
image(as.matrix(dist_mat), main = "Genetic Distance Heatmap")

# SVG plot using plot_ly()
library(plotly)

# 1. Get the unique viral strains
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])

# 2. Generate clean Short IDs (e.g., "Strain 1", "Strain 2", etc.)
# Or use the first few characters if that's where the ID is hidden
short_ids <- paste0("S-", seq_along(unique_strains)) 

# 3. Calculate the matrix using the full sequences
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
m <- as.matrix(dist_mat)

# 4. Handle Infinities
m[is.infinite(m)] <- max(m[is.finite(m)], na.rm = TRUE)

# 5. Assign the CLEAN IDs to the matrix
rownames(m) <- short_ids
colnames(m) <- short_ids

# 6. Plot
fig <- plot_ly(
    z = m, 
    x = colnames(m), 
    y = rownames(m), 
    type = "heatmap",
    colorscale = "Viridis",
    # This customdata trick lets you see the FULL sequence when you hover!
    customdata = matrix(unique_strains, nrow=nrow(m), ncol=ncol(m)),
    hovertemplate = "Row: %{y}<br>Col: %{x}<br>Dist: %{z}<extra></extra>"
  ) %>%
  layout(
    xaxis = list(title = "Strain ID", type = "category"),
    yaxis = list(title = "Strain ID", type = "category", autorange = "reversed")
  )

fig

Finally, display the transmission tree graph.

inf_graph <- graph_from_data_frame(edges, directed = TRUE)

ggraph(inf_graph, layout = 'kk') + 
  geom_edge_link(alpha = 0.3, arrow = arrow(length = unit(2, 'mm'))) + 
  geom_node_point(color = "red", size = 1) + 
  theme_void() + 
  labs(title = "Epidemiological Transmission Network")

Applications and Software

Modern epidemiology relies heavily on computational tools.

Key R Packages: epiR (Epi measures), genetics (HWE, LD), survival (time-to-event), qqman (GWAS visualization).
Online Tools: SOCR Distribution Tables.

Problems

Problem 1: Linkage Mapping

Scenario: Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction $\theta$.

Calculate LOD Scores:

Using Maximum Likelihood Estimation (MLE), if the phase is known and the data include 4 non-recombinant and 1 recombinant meioses, the likelihood is: \[L(\theta) = (1-\theta)^4 \theta\]. The LOD score is defined as: \[Z(\theta) = \log_{10}\left(\frac{L(\theta)}{L(0.5)}\right)\], where $L(0.5) = (0.5)^5 = 1/32$.

Corrected Result Table:

$\theta$	$Z(\theta)$
0.0	$-\infty$
0.10	0.322
0.20	0.418 (Max LOD)
0.50	0.0

The maximum LOD score occurs at $\hat{\theta} = 0.20$, consistent with 1 recombinant out of 5 informative meioses. > Note: A LOD > 3 is conventionally required for significant linkage; this example is illustrative.

Problem 2: Number Needed to Treat (NNT) and Harm (NNH)

Scenario: A trial reports 800 events among 1000 patients in Treatment Group A and 600 events among 1200 patients in Control Group B.

Event rates:

\[p_A = 800/1000 = 0.80\], $p_B = 600/1200 = 0.50$.

Absolute Risk Increase (ARI):

\[\text{ARI} = p_A - p_B = 0.30\].

Since the treatment increases risk, we compute the Number Needed to Harm (NNH):

\[\text{NNH} = \frac{1}{\text{ARI}} = \frac{1}{0.30} \approx 3.33\].

Interpretation: For every 3–4 patients treated with intervention A (vs. control), one additional adverse event occurs. The negative NNT is conventionally reported as NNH.

Problem 3: GWAS Power Analysis (R)

Scenario: Simulate a case-control GWAS with 500 cases and 500 controls, Minor Allele Frequency (MAF) = 0.2, and Odds Ratio (OR) = 1.5. Estimate statistical power at $\alpha = 0.05$.

The following R function correctly simulates genotypes separately for cases and controls based on genotype relative risks derived from the OR:

 gwas_power_case_control <- function(n_cases = 200, n_controls = 200, 
                                  maf = 0.2, OR = 1.5, 
                                  alpha = 0.05, n_sims = 100) {
 # Genotype coding: 0, 1, 2 copies of effect allele
 # Relative risks under additive log-odds model: 1, OR, OR^2
 rr <- c(1, OR, OR^2)
 
 # Genotype frequencies in controls (Hardy-Weinberg)
 g_ctrl <- dbinom(0:2, size = 2, prob = maf)
 
 # Genotype frequencies in cases (proportional to g_ctrl * rr)
 g_case_unscaled <- g_ctrl * rr
 g_case <- g_case_unscaled / sum(g_case_unscaled)
 
 significant <- numeric(n_sims)
 
 for (i in 1:n_sims) {
   # Simulate genotypes
   geno_cases <- sample(0:2, n_cases, replace = TRUE, prob = g_case)
   geno_ctrls <- sample(0:2, n_controls, replace = TRUE, prob = g_ctrl)
   
   geno <- c(geno_cases, geno_ctrls)
   status <- c(rep(1, n_cases), rep(0, n_controls))
   
   # Fit logistic regression
   model <- glm(status ~ geno, family = binomial)
   p_val <- summary(model)$coefficients[2, 4]
   significant[i] <- as.numeric(p_val < alpha)
 }
 
 return(mean(significant))  # Estimated power
 }

 # Example usage:
 set.seed(2026)
 power_est <- gwas_power_case_control()
 cat("Estimated power:", round(power_est, 3), "\n")

Problem 4: Hardy-Weinberg Equilibrium (HWE) Testing

Scenario: In a GWAS control group ($n = 1000$), genotype counts for a SNP are: AA = 240, Aa = 120, aa = 40. Test HWE using an exact test.

 # Install if needed: install.packages("HardyWeinberg")
 library(HardyWeinberg)

 # Observed counts: AA, Aa, aa
 obs <- c(240, 120, 40)

 # Exact HWE test
 hwe_test <- HWExact(obs, verbose = FALSE)
 p_val <- hwe_test$pval

 cat("HWE exact test p-value:", format.pval(p_val, digits = 4), "\n")

 # Interpretation: p < 0.001 suggests deviation from HWE (common QC threshold in GWAS)

Problem 5: Assessing Confounding in Observational Data

Scenario: Evaluate whether age and sex confound the association between an exposure and disease outcome.

We'll use the Logistic Distribution function plogis, which provides the standard Density, distribution function, quantile function and random generation for the logistic distribution with parameters location and scale, and rbinom.

  set.seed(123)
  n <- 5000
  # Simulate confounders and exposure

  age <- rnorm(n, mean = 50, sd = 10)
  sex <- rbinom(n, 1, 0.5)
  exposure <- rbinom(n, 1, plogis(-1 + 0.02 * age + 0.3 * sex))

  # Simulate disease influenced by exposure and confounders
  disease <- rbinom(n, 1, plogis(-2 + 0.9 * exposure + 0.03 * age + 0.4 * sex))


  # Crude model
  crude_or <- exp(coef(glm(disease ~ exposure, family = binomial))["exposure"])


  # Adjusted model
  adj_or <- exp(coef(glm(disease ~ exposure + age + sex, family = binomial))["exposure"])

  # Percent change due to confounding
  confounding_pct <- (crude_or - adj_or) / crude_or * 100
  cat("Crude OR:", round(crude_or, 2), "\n")
  cat("Adjusted OR:", round(adj_or, 2), "\n")
  cat("Percent change:", round(confounding_pct, 1), "%\n")

  if (abs(confounding_pct) > 10) {
   cat("→ Substantial confounding detected (change > 10%).\n")
  } else cat("→ No substantial confounding detected (change < 10%).\n")

Problem 6: Diagnostic Test Accuracy via ROC Analysis

Scenario: A continuous biomarker is measured in 100 diseased and 100 non-diseased individuals. Determine the optimal cutoff, sensitivity, and specificity using Youden’s index.

  # Install if needed: install.packages("pROC")
  library(pROC)

  # Simulate biomarker scores
  set.seed(42)
  score_healthy <- rnorm(100, mean = 0, sd = 1)
  score_diseased <- rnorm(100, mean = 1.5, sd = 1)

  score <- c(score_healthy, score_diseased)
  status <- c(rep(0, 100), rep(1, 100))

  # ROC analysis
  roc_curve <- roc(status, score)
  auc_val <- auc(roc_curve)

  # Optimal threshold (Youden's J statistic)
  opt <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity"))

  cat("AUC:", round(auc_val, 3), "\n")
  cat("Optimal cutoff:", round(opt[["threshold"]], 2), "\n")
  cat("Sensitivity:", round(opt[["sensitivity"]], 3), "\n")
  cat("Specificity:", round(opt[["specificity"]], 3), "\n")

Problem 7

Try to integrate structural fixes, biological realism (fitness selection), and technical optimizations into the above genetics simulation of Phylodynamics and Genomic Surveillance in R. Try to demonstrates how viral mutations, spatial movement, and natural selection interact to shape an outbreak.

The Core Simulation Engine can initialize a population and simulates a Random Walk movement model, by introducing a Fitness Function; mutations at specific genomic loci now increase the transmission probability, simulating the emergence of a Variant of Concern.

library(tidyverse)
library(gganimate)
library(stringdist)
library(igraph)
library(ggraph)
library(plotly)

# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.02
timesteps <- 60
seq_length <- 50
bases <- c("A", "C", "G", "T")

# --- Helper Functions ---
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")

mutate_dna <- function(sequence) {
  seq_vec <- strsplit(sequence, "")[[1]]
  mutation_mask <- runif(seq_length) < mutation_rate
  if(any(mutation_mask)) {
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
  }
  return(paste(seq_vec, collapse = ""))
}

calc_fitness <- function(sequence) {
  gc_content <- nchar(gsub("[AT]", "", sequence)) / nchar(sequence)
  return(0.3 + (gc_content * 0.2)) 
}

# --- Initialize Population ---
pop <- tibble(
  id = 1:n_agents,
  x = runif(n_agents, 0, map_size),
  y = runif(n_agents, 0, map_size),
  state = "S",
  genome = founder_seq,
  parent_id = NA_integer_,
  infect_time = NA_real_
)

# Set state to "I" for initial cases
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0

# --- Run Simulation ---
history <- list()
edges <- data.frame(from = integer(), to = integer())

for (t in 1:timesteps) {
  
  # Assignment operator (<-) to update movement
  # loaded cyclical harmonics
  # pop$x <- pop$x * rep(0.1 + cos(10*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
  # pop$y <- pop$y * rep(0.1 + sin(5*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
  
  # simple stochastic dynamics
  pop$x <- pop$x + rnorm(n_agents, 0, 2)
  pop$y <- pop$y + rnorm(n_agents, 0, 2)

  # Optional: Keep agents within map boundaries
  pop$x <- pmax(0, pmin(map_size, pop$x))
  pop$y <- pmax(0, pmin(map_size, pop$y))

  # 2. Transmission
  infected_idx <- which(pop$state == "I")
  
  for (i in infected_idx) {
    susceptible_idx <- which(pop$state == "S")
    if(length(susceptible_idx) == 0) break
    
    dists <- sqrt((pop$x[susceptible_idx] - pop$x[i])^2 + (pop$y[susceptible_idx] - pop$y[i])^2)
    contacts <- susceptible_idx[dists < 6] 
    
    for (target in contacts) {
      if (pop$state[target] == "S" && runif(1) < calc_fitness(pop$genome[i])) {
        pop$state[target] <- "I"
        pop$infect_time[target] <- t
        pop$parent_id[target] <- pop$id[i]
        pop$genome[target] <- mutate_dna(pop$genome[i])
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
      }
    }
  }

  # subsetting and assignment for Recovery
  recovery_idx <- which(pop$state == "I" & (t - pop$infect_time) >= 10)
  if(length(recovery_idx) > 0) {
    pop$state[recovery_idx] <- "R"
  }
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)
# str(sim_data)

2. Spatio-Temporal Visualization: By mapping the genomic strings to colors, we can visualize the "clade emergence" over time.

anim <- ggplot(sim_data, aes(x = x, y = y, color = genome, size = state)) +
geom_point(alpha = 0.6) +
scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Viral Spread & Selection Pressure: Day {frame}") +
transition_time(time)

# animate(anim)

3. Genomic Tracking & Distance Matrices: To understand how much the virus has drifted, we use the **Hamming Distance**. This calculates the number of base substitutions between any two viral samples.

# Extract final strains
final_strains <- sim_data %>%
filter(state %in% c("I", "R"), !is.na(infect_time)) %>%
distinct(genome) %>%
pull(genome)

dist_mat <- stringdistmatrix(final_strains, final_strains, method = "hamming")
rownames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
colnames(dist_mat) <- paste0("Strain_", 1:length(final_strains))

# Visualize Genetic Distance
plot_ly(z = ~dist_mat, x = colnames(dist_mat), y = rownames(dist_mat), type = "heatmap", colorscale = "Viridis") %>%
layout(title = "Genetic Hamming Distance Matrix")

4. Phylogeny: The Transmission Tree: While a distance matrix shows "how much" change occurred, the Transmission Tree shows "who infected whom." A "star-like" tree indicates rapid expansion, while a "ladder-like" tree indicates a slow, constant chain of transmission.

# Reconstruct the tree using igraph
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

ggraph(transmission_graph, layout = 'tree') +
geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
geom_node_point(aes(color = name), size = 3, show.legend = FALSE) +
theme_void() +
labs(title = "Reconstructed Epidemiological Transmission Tree")

5. Quantifying the Molecular Clock: A key concept in phylodynamics is the Molecular Clock, the idea that mutations accumulate at a relatively constant rate over time.

# Calculate divergence from founder
clock_data <- sim_data %>%
filter(!is.na(infect_time)) %>%
mutate(dist_from_founder = stringdist(genome, founder_seq, method = "hamming"))

ggplot(clock_data, aes(x = infect_time, y = dist_from_founder)) +
geom_jitter(alpha = 0.2, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "The Molecular Clock: Mutation Accumulation",
x = "Time of Infection (Days)",
y = "Hamming Distance from Founder")

Here are a couple of alternative plots ...

library(ggraph)
library(ggplot2)
library(plotly)

# Create the ggraph plot
gg <- ggraph(transmission_graph, layout = 'tree') +
  geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
  geom_node_point(aes(color = name, 
                      text = paste("ID:", name,
                                   "<br>Click for details")), 
                  size = 5) +
  theme_void() +
  labs(title = "Reconstructed Epidemiological Transmission Tree")

# Convert to interactive plotly
ggplotly(gg, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
  )

and ...

library(plotly)
library(igraph)
library(networkD3)

# Convert igraph to networkD3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)

# Prepare nodes with tooltips
nodes_d3 <- transmission_d3$nodes
if(exists("pop")) {
  nodes_d3 <- nodes_d3 %>%
    left_join(
      pop %>% 
        mutate(name = as.character(id)) %>% 
        select(name, state, infect_time, genome),
      by = "name"
    )
} else {
  nodes_d3$state <- "Unknown"
  nodes_d3$infect_time <- NA
  nodes_d3$genome <- NA
}

# Create custom tooltips
nodes_d3$tooltip <- sprintf(
  "ID: %s<br>State: %s<br>Infection Time: %s<br>Genome: %s",
  nodes_d3$name,
  nodes_d3$state,
  ifelse(is.na(nodes_d3$infect_time), "N/A", nodes_d3$infect_time),
  ifelse(is.na(nodes_d3$genome), "N/A", substr(nodes_d3$genome, 1, 15))
)

# Create the force-directed network
forceNetwork(
  Links = transmission_d3$links,
  Nodes = nodes_d3,
  Source = 'source',
  Target = 'target',
  NodeID = 'name',
  Group = 'state',
  Nodesize = 5,
  opacity = 0.9,
  zoom = TRUE,
  fontSize = 14,
  linkDistance = 150,
  charge = -100,
  opacityNoHover = 0.5,
  legend = TRUE,
  bounded = TRUE,
  clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
)

and ...

library(visNetwork)
library(igraph)
library(tidyverse)

# 1. Create the graph object
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

# 2. Prepare nodes data
# Get vertex names
node_names <- V(transmission_graph)$name

# Merge with pop data for additional information
nodes <- data.frame(
  id = node_names,
  label = node_names,
  title = NA,  # Will be filled with tooltip info
  stringsAsFactors = FALSE
)

# Join with pop data to get metadata for tooltips
if(exists("pop")) {
  nodes <- nodes %>%
    left_join(
      pop %>% 
        mutate(id = as.character(id)) %>% 
        select(id, state, infect_time, genome),
      by = "id"
    ) %>%
    mutate(
      title = paste0(
        "<b>Agent ID:</b> ", id, "<br>",
        "<b>State:</b> ", state, "<br>",
        "<b>Infection Time:</b> ", infect_time, "<br>",
        "<b>Genome:</b> ", ifelse(!is.na(genome), substr(genome, 1, 20), "N/A")
      )
    )
} else {
  # Fallback if pop doesn't exist
  nodes$title <- paste0("<b>Agent ID:</b> ", nodes$id)
}

# 3. Prepare edges data
edges_df <- igraph::as_data_frame(transmission_graph, what = "edges")
edges_vis <- data.frame(
  from = edges_df$from,
  to = edges_df$to,
  arrows = "to",
  color = list(color = "gray", opacity = 0.6),
  width = 1,
  smooth = TRUE
)

# 4. Optional: Add colors based on state or infection time
if("state" %in% colnames(nodes)) {
  # Create color mapping for different states
  unique_states <- unique(nodes$state)
  colors <- viridisLite::viridis(length(unique_states))
  color_map <- setNames(colors, unique_states)
  
  nodes$color <- sapply(nodes$state, function(s) {
    if(is.na(s)) "#808080" else color_map[s]
  })
} else if("infect_time" %in% colnames(nodes)) {
  # Color by infection time
  min_time <- min(nodes$infect_time, na.rm = TRUE)
  max_time <- max(nodes$infect_time, na.rm = TRUE)
  
  # Normalize infection time for color scale
  nodes$value <- nodes$infect_time
  nodes$color <- viridisLite::viridis(100)[
    cut(nodes$infect_time, 100, labels = FALSE)
  ]
} else {
  nodes$color <- "#2C3E50"  # Default color
}

# 5. Create the interactive network
visNetwork(nodes, edges_vis, 
           main = "Interactive Transmission Tree",
           submain = "Hover over nodes for details") %>%
  visNodes(
    shape = "dot",
    size = 15,
    borderWidth = 2,
    borderWidthSelected = 3,
    shadow = list(enabled = TRUE, size = 10)
  ) %>%
  visEdges(
    shadow = TRUE,
    smooth = list(type = "curvedCW", roundness = 0.1)
  ) %>%
  visOptions(
    highlightNearest = list(
      enabled = TRUE,
      degree = 1,
      hover = TRUE,
      algorithm = "hierarchical"
    ),
    nodesIdSelection = list(
      enabled = TRUE,
      values = nodes$id,
      style = 'width: 150px; height: 26px'
    )
  ) %>%
  visInteraction(
    hover = TRUE,
    hoverConnectedEdges = TRUE,
    tooltipDelay = 0,
    navigationButtons = TRUE,
    dragNodes = TRUE,
    dragView = TRUE,
    zoomView = TRUE
  ) %>%
  visPhysics(
    stabilization = TRUE,
    solver = "forceAtlas2Based",
    forceAtlas2Based = list(
      gravitationalConstant = -50,
      centralGravity = 0.01,
      springLength = 100,
      springConstant = 0.08,
      damping = 0.4,
      avoidOverlap = 1
    )
  ) %>%
  visLayout(randomSeed = 123)  # For reproducible layout

As well as ...

# Alternative using plotly's network graph
library(networkD3)

# Convert to D3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)

# Add node attributes
transmission_d3$nodes <- transmission_d3$nodes %>%
  left_join(
    pop %>% 
      mutate(name = as.character(id)) %>% 
      select(name, state, infect_time, genome),
    by = "name"
  ) %>%
  mutate(
    tooltip = paste0(
      "Agent ID: ", name, 
      "<br>State: ", state,
      "<br>Infection Time: ", infect_time,
      "<br>Genome: ", ifelse(!is.na(genome), substr(genome, 1, 10), "NA"), "..."
    )
  )

# Create force directed network
forceNetwork(
  Links = transmission_d3$links,
  Nodes = transmission_d3$nodes,
  Source = 'source',
  Target = 'target',
  NodeID = 'name',
  Group = 'state',
  Nodesize = 5,
  opacity = 0.8,
  zoom = TRUE,
  fontSize = 12,
  linkDistance = 100,
  charge = -30,
  opacityNoHover = 0.5,
  legend = TRUE
)

Also try to experiment with:

1. Examine Selection Pressure: Modify the calc_fitness() function to give a 50% transmission boost to any sequence containing the pattern GAA. How does this change the speed of the outbreak?

2. Are there Bottleneck Effects? Reduce the initially_infected' to 1 and observe how the genetic diversity of the final population changes compared to starting with 10 infected individuals.

3. Contact Tracing: Use the transmission_graph to identify Super-spreaders, i.e., nodes with high out-degree.

References

DSPA2 Appendix 26: Eidemiology
Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott, 2008.
Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, 2013.
Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2010.

SOCR Home page: https://socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 1: / Line 1: @@
-<hr>
+== [[SMHS|Scientific Methods for Health Sciences]] - Epidemiology ==
-* SOCR Home page: http://www.socr.umich.edu
-==[[SMHS| Scientific Methods for Health Sciences]] - Epidemiology ==
-===Overview:===
+=== Overview ===
+Epidemiology is the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow's investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.
-After a general introduction to the filed of Epidemiology, students can have a basic idea of the language of Epidemiology. In this course, we want to identify and describe the population patterns of health-related risk factors and health-related outcomes in terms of persons, place and time. We are interested in exploring current major public health issues and try to identify and evaluate the main determinants of such public health issues (e.g. demographic, genetic, infectious, behavioral, and social). With all the concepts and methodologies of analysis in Epidemiology, application would be the next step. Here we examine and apply analytical approaches to data from different epidemiologic study designs (e.g., cross-sectional, cohort, randomized studies) and to critically appraise epidemiological findings.
+A core framework in epidemiology is the "epidemiologic triad" of agent, host, and environment, but contemporary approaches emphasize the "person, place, and time" dimensions:
+* '''Person''': Characteristics of individuals, such as age, sex, genetics, socioeconomic status, and behaviors that influence susceptibility or exposure.
+* '''Place''': Geographic variations, including urban vs. rural settings, climate, or access to healthcare, which can reveal environmental or social risk factors.
+* '''Time''': Temporal patterns, such as seasonal trends, secular changes over years, or sudden outbreaks, helping identify emerging threats or intervention effects.
-===Motivation:===
+This section delves into genetic epidemiology, bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.
-Goals of this course:
+=== Motivation ===
+This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:
+* '''Understand Genetic Foundations''': Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tending to be inherited together unless separated by recombination).
+* '''Analyze Population Dynamics''': Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
+* '''Evaluate Associations''': Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
+* '''Apply Computational Methods''': Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
+* '''Additional Skills''': Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.
-*To understand basic features of the human genome and the distribution of mutations among individuals.
+These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.
-*To understand the principles of segregation and linkage as they apply to human pedigree analysis and the identification of genetic variations associated with disease.
-*To learn population and quantitative genetic concepts that are necessary in order to study the relationship between genetic variation and disease variation in populations.
-*To learn about prototypical gene-disease relationships that are important to public health.
-*To understand the key issues in genetic testing in populations.
-*To understand the genetic complexity of common chronic disease.
-*To have a basic understanding of the importance and biological basis of epigenetic mechanisms, gene-environment interactions, and gene-gene interactions.
-===Theory===
+=== Theory: The Human Genome and Mutation ===
-====Public Heath Genetics====
+The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000–25,000 genes that orchestrate cellular functions, development, and responses to the environment. ''Mutations'', changes in this DNA sequence, can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.
-Current and potential applications of genome research include:
-*Molecular medicine: improved diagnosis of disease; earlier detection of genetic predispositions to disease; rational drug design; gene therapy and control systems for drugs; pharmacogenomics “custom drugs”.
-*Microbial genomics: new energy sources (biofuels); environmental monitoring to detect pollutants; protection from biological and chemical warfare; safe, efficient toxic waste cleanup.
-*Risk assessment: assess health damage and risks caused by radiation exposure, include low-dose exposures; assess health damage and risks caused by exposure to mutagenic chemicals and cancer causing toxins.
-*Bio-archaeology, anthropology, evolution, and human migration: study evolution through germline mutations in lineages; study migration of different population groups based on X chromosome or Y chromosome; compare breakpoints in the evolution of mutations with ages of populations and historical events.
-*DNA forensics (identification): identify potential suspects whose DNA may match evidence left at crime scenes; exonerate persons wrongly accused of crimes; identify crime and catastrophe victims; establish paternity and other family relationships; determine pedigree for seed or livestock breeds.
-*Agriculture, livestock breeding, and bioprocessing: more nutritious produce; Biopesticides; healthier, more productive, disease-resistant farm animals; new environmental cleanup uses for plants like tobacco.
-====The Human Genome and Mutation====
 <center>
-[[Image:SMHS_Epidem_Fig_1.png |650px]]
+[[Image:SMHS_Epidem_Fig_1.png |650px|thumb|Figure 1: Illustration of human chromosomal structure, highlighting key features like centromeres, telomeres, and banding patterns.]]
 </center>
-*Chromosomes are highly condensed DNA:
+==== Chromosomal Structure ====
-**Chromosomal banding pattern: condensed chromosomes can be stained to create the appearance of dark and light bands; dark bands represent regions rich in As and Ts; each band contains millions of DNA nucleotides; each chromosome has a unique banding pattern.
+Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.
-**A human karyotype: depicts the entire chromosomal constitutions of a person; normal karyotypes have 46 chromosomes; we get 23 chromosomes from each parent (22 autosomes and 1 sex chromosome).
+* '''Banding''': Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
-**Chromatin: composed of DNA and proteins that are associated with the chromosomes. (1) Euchromatin: lightly condensed DNA; gene rich, often actively transcribed. (2) Heterochromatin: highly condensed DNA; often composed of repetitive DNA elements; centromeres and telomeres.
+* '''Karyotype''': The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
-**Centromeres: large arrays of repeated DNA sequences; spindle fibers attach during mitosis to separate sister chromatids.
+* '''Functional Elements''':
-**Telomeres: arrays of repeated DNA sequences that are often thousands of bases in length; a “cap” at the end of chromosome to provide stability; due to the way that chromosomes replicated, telomeres, shorten with each cell division in human somatic cells.
+  * '''Centromeres''': Central constrictions composed of repetitive alpha-satellite DNA sequences. They serve as attachment points for spindle fibers during mitosis and meiosis, ensuring proper chromosome segregation. Centromeric dysfunction can lead to aneuploidy (abnormal chromosome numbers).
-**International system for human cytogenetic nomenclature: short arms of a chromosome are labeled; long arms are labeled; chromosome bands are labeled p11, p12, etc. like a zip code; the terminal ends of the chromosomes are labeled ter; where the arms meet in the middle is the centromere.
+  * '''Telomeres''': Protective caps at chromosome ends, consisting of repetitive TTAGGG sequences (in humans) bound by shelterin proteins. Telomeres shorten with each somatic cell division due to the "end-replication problem," contributing to cellular aging (senescence). In germ cells and stem cells, telomerase enzyme maintains telomere length. Shortened telomeres are linked to diseases like dyskeratosis congenita and increased cancer risk.
-**Genes are located on chromosomes: there are 45 bands on chromosome 5; chromosome 5 contains 1633 genes; chromosome 5 ~ 181000000 bases long; genes are referred to by their chromosomal location.
+  * '''Chromatin''': The complex of DNA and proteins (histones) that forms chromosomes. It exists in two states:
-**Chromosomal abnormalities: there are two types of abnormalities that can occur on a chromosomal level in humans: (1) structural abnormalities – missing, extra, or rearranged genetic material on one particular chromosome; (2) numerical abnormalities – deviations in the total number of chromosomes that an individual has.
+    * '''Euchromatin''': Loosely packed, transcriptionally active, and rich in genes and regulatory elements.
+    * '''Heterochromatin''': Tightly packed, transcriptionally silent, often containing repetitive DNA (e.g., satellites, transposons) near centromeres and telomeres. Epigenetic modifications (e.g., histone methylation) regulate chromatin states.
+==== Mutations and Abnormalities ====
+Mutations are heritable changes in the DNA sequence, occurring at a rate of about 10⁻⁸ per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.
-Changes in chromosome structure.
 <center>
-[[Image:SMHS_Epidemiology_Fig_2.png|400px]]
+[[Image:SMHS_Epidemiology_Fig_2.png|400px|thumb|Figure 2: Schematic of common structural chromosomal abnormalities, including deletion, duplication, translocation, and inversion.]]
 </center>
+* '''Structural Abnormalities''' (Large-scale changes visible under a microscope, affecting thousands to millions of base pairs):
+  * '''Deletion''': Removal of a chromosomal segment, leading to loss of genes (e.g., cri-du-chat syndrome from 5p deletion).
+  * '''Duplication''': Extra copy of a segment, potentially causing gene dosage imbalances (e.g., Charcot-Marie-Tooth disease from PMP22 duplication).
+  * '''Translocation''': Exchange of segments between non-homologous chromosomes. Balanced translocations may be asymptomatic but increase risks in offspring; unbalanced ones cause disorders (e.g., chronic myeloid leukemia from t(9;22) "Philadelphia chromosome").
+  * '''Inversion''': Reversal of a segment within a chromosome, which can disrupt genes or lead to abnormal recombination (e.g., hemophilia A inversions).
+  * '''Other Types''': Isochromosomes (duplicated arms, e.g., i(Xq) in Turner syndrome) or ring chromosomes (circular formations from deletions at both ends).
+* '''Point Mutations''' (Small-scale changes at the nucleotide level, detected via sequencing):
+  * '''Nucleotide Substitution''': Replacement of one base with another.
+    * '''Silent''': No amino acid change (due to codon degeneracy).
+    * '''Missense''': Amino acid change (e.g., sickle cell anemia from GAG to GTG in beta-globin gene).
+    * '''Nonsense''': Introduces premature stop codon, truncating the protein (e.g., cystic fibrosis).
+  * '''Indels''': Insertions or deletions of nucleotides. Small indels can cause frameshifts, altering the reading frame and producing dysfunctional proteins (e.g., Tay-Sachs disease).
+  * '''Splice Site Variation''': Mutations in intronic regions affecting mRNA splicing, leading to exon skipping or inclusion of introns (e.g., beta-thalassemia).
+* '''Epidemiological Relevance''': Mutations' population distribution informs disease prevalence. Rare mutations cause Mendelian disorders (e.g., Huntington's), while common variants (SNPs) contribute to complex traits via polygenic risk scores. Tools like next-generation sequencing (NGS) enable large-scale mutation detection in cohort studies.
+=== Population Genetics ===
+Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.
+==== Allele and Genotype Frequencies ====
+The '''gene pool''' is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.
+* '''Allele Frequency''': Proportion of a specific allele at a locus. For a biallelic locus, frequencies sum to 1.
+<math>\text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}.</math>
+:  Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (2*40 + 50) / 200 = 0.65; q (a) = (2*10 + 50) / 200 = 0.35.
+* '''Genotype Frequency''': Proportion of individuals with a specific genotype (e.g., P(AA) = number of AA / total individuals).
+* '''Applications''': Used in GWAS to compare frequencies between cases and controls; deviations can indicate associations.
+==== Hardy-Weinberg Equilibrium (HWE) ====
+HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p):
+: <math>P(AA) = p^2\ (homozygous\ dominant)</math>
+: <math>P(Aa) = 2pq\ (heterozygous)</math>
+: <math>P(aa) = q^2\ (homozygous\ recessive)</math>
-* Deletion: 46, XY, del(6) (p16.3)    Terminal deletion with breakpoint at 6p16.3
+* '''Deviations from HWE''': Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.
-* Duplication: 46, XX, dup(1) (q22q25)  Duplication of chromosome 1 region q22 to q25
-* Insertion: 46, XY, ins(2;5) (p13;q21q31)  An insertion of chromosome 5q21-31 into chromosome 2p13
-* Translocation: 46, XX, t(2;6) (q35;p21.3)  A balanced reciprocal translocation with breakpoints in 2q35 and 6p21.3
-* Inversion: 46, XY, inv(11) (p11p15)  An inversion on chromosome 11 with breakpoints at p11 and p15
-*Mutations: caused by changes in the DNA sequence; there are many different types of mutations; can happen in somatic cells or during development of gametes.
-* Types of mutations:
-:: Nucleotide substitutions, involve an alteration in the sequence but not the number of nucleotides (DNA bases) in a gene;
-:: Insertions & Deletions, involve an alteration in the number of nucleotides in a gene;
-:: Trinucleotide repeats, involves an alteration in the number of times that a certain sequence of three bases repeats itself;
-:: Splice Site Variation, involve an alteration in the non-coding region of a gene, which affects the way that parts of the gene sequence are combined to make RNA.
-* Results of mutations: mutations in exons may result in – misspelling of protein (missense), truncation of the protein (nonsense), no effect; mutations in introns may result in – no effect, altered regulations of gene expression, splice site variation.
-====Genes in Population====
+* '''Testing for HWE''': Use Chi-squared goodness-of-fit test.
-*Gene pool: all available genetic variation in a population; all potential mating combinations.
-*Alleles: the type of genetic variation seen at a particular location on a chromosome. (1) fictional form: big “A” allele, little “a” allele; (2) base pair form: T allele at basepair 71349562 on chromosome 2, C allele at basepair 71349562 on chromosome 2; (3) codon form: Arginine at codon 124, Glycine at codon 124.
-*Genotypes: we inherit one allele from our mother and one allele from our father to form our genotype. Variation in a single gene like AA, Aa or aa.
-*Haplotypes: it is the combination of alleles that an individual has at multiple sites along a chromosome.
-*Allele frequencies: the prevalence of a particular allele in a given population. Allele frequency = $\frac {Number\, of\, alleles\,}{2*(number\, of\, people\,)}$.
-*Genotype frequencies: prevalence of a particular genotype in a given population.
-**Haplotype frequencies: frequency that a haplotype occurs in different ethnic groups.
-====Hardy-Weinberg disequilibrium====
+: Formula:
-When genotype frequencies in a population differ from what would be predicted based on allele frequencies.
+<math>\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},</math>
-*:Hardy-Weinberg Equilibrium (HWE): in a stable population with random mating, allelic frequencies typically predict genotype frequencies using the law of independent segregation. When allele frequencies can accurately be used to predict genotype frequencies in a population, the population is considered to be in a HWE.
+df=1 for biallelic loci.
-*:HWE: suppose a SNP that can only be A or C $(p_{A}+p_{C}=1)$, and probability of having A allele is $p_{A}$, C allele is $p_{C}$, then under HWE, the probability of AA genotype = $p_{A}^{2}$, probability of having CC = $p_{C}^{2}$, the probability of having AC = $2p_{A}p_{C}$.
-*:Steps to test HWE: (
+: Critical value: >3.84 for p<0.05 (reject HWE).
-:: estimate allele frequencies;
-:: calculate the expected relative genotype frequencies under HWE;
+: Example:
-:: calculate the expected number of people with each genotype;
+:: Observed: 400 AA, 500 Aa, 100 aa (total 1000). p=0.65, q=0.35.
-:: calculate the difference between observed and expected number of people with each genotype using $χ^2$ formula;
+:: Expected: 422.5 AA, 455 Aa, 122.5 aa.
-:: sum up the $χ^2$ components and compare the sum to statistical tables to see if there is significant deviation.
+:: χ² ≈ 10.5 (p<0.05, deviation).
+'''R Implementation for HWE:'''
+This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the ''HardyWeinberg'' package.
+<pre>
+# Hardy-Weinberg Equilibrium Test
+# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
+obs <- c(AA = 400, Aa = 500, aa = 100)  # Example observed counts
+# Input validation
+if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")
+total <- sum(obs)
+if (total == 0) stop("No individuals in population")
+# Calculate allele frequencies
+p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total)  # Frequency of A
+q <- 1 - p  # Frequency of a
+# Expected counts under HWE
+expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)
+# Chi-squared test (ensure expected >5 for validity)
+if (any(expected < 5)) warning("Expected counts <5; consider exact test")
+chi2 <- sum((obs - expected)^2 / expected)
+df <- 1  # Degrees of freedom for biallelic
+p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)
+# Optional: Fisher's exact test for small samples (using contingency table simulation)
+# But for simplicity, we use chi-squared here
+# Output results
+cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")
+cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")
+cat("Chi-squared =", round(chi2, 4), "df =", df, "p-value =", format.pval(p_val_chi), "\n")
+if (p_val_chi < 0.05) {
+  cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
+} else {
+  cat("Fail to reject HWE: Population appears in equilibrium.\n")
+}
+</pre>
+=== Pedigree Analysis and Inheritance ===
+Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.
 <center>
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+[[Image:MSHS IntroEpi Fig 5 .png|600px|thumb|Figure 5: Example pedigree illustrating inheritance patterns, with symbols for affected individuals, carriers, and relationships across generations.]]
-|-
-| || Observed|| Expected ||  $X^{2}$ component
-|-
-|AA || $N_{AA}$ || $p^{2}(N_{Total})$ || $\frac{(O_{AA}- E_{AA})^{2}}{E_{AA}}$
-|-
-|Aa || $N_{AS}$ || $2pq^{2}(N_{Total})$ || $\frac{(O_{Aa}- E_{Aa})^{2}}{E_{Aa}}$
-|-
-|aa ||$N_{aa}$ || $q^{2}(N_{Total})$ || $\frac{(O_{aa}- E_{aa})^{2}}{E_{aa}}$
-|-
-| || $N_{Total}$ || $N_{Total}$ || Overall $X^{2}$ Statistics
-|-
-|}
 </center>
-$N_{AA}$ is the actual number of people in the population with AA genotype, $N_{Total}$ is the number of people in the population, $p^{2}(N_{Total})$ calculated expected number of people with AA genotype.
+==== Modes of Inheritance ====
-For $χ^{2}$, the null hypothesis $H_{0}$: the population is in HWE. For two alleles, if the overall $χ^{2}$ is less than or equal to 3.84, then p-value is greater than 0.05 and don’t reject $H_{0}$, population is in HWE; if $χ^{2}$ more than 3.84, p-value is less than 0.05, reject $H_{0}$, population is in HWD (disequilibrium).
+Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.
+* '''Autosomal Dominant (AD)''':
+  * '''Definition''': A single copy of the dominant allele (D) is sufficient to express the phenotype. Affected individuals are heterozygous (Dd) or homozygous (DD), while dd are unaffected.
+  * '''Pedigree Characteristics''': Vertical transmission (affected individuals in every generation); no skipped generations if fully penetrant. Affects males and females equally; 50% risk to offspring of an affected parent.
+  * '''Example''': Huntington's disease (CAG repeat expansion in HTT gene on chromosome 4). Late-onset, neurodegenerative disorder with high penetrance.
+  * '''Epidemiological Notes''': Often rare; founder effects in isolated populations increase prevalence. Used in linkage studies to map genes.
+  * '''Limitations''': Incomplete penetrance or variable expressivity (different symptom severity) can mimic other patterns.
+* '''Autosomal Recessive (AR)''':
+  * '''Definition''': Requires two copies of the recessive allele (dd) for expression. Heterozygotes (Dd) are asymptomatic carriers.
+  * '''Pedigree Characteristics''': Horizontal transmission (affected siblings from unaffected parents); skips generations. Consanguinity (e.g., cousin marriages) increases risk; equal male-female incidence.
+  * '''Example''': Cystic fibrosis (mutations in CFTR gene on chromosome 7). Affects lung and digestive functions; carrier frequency ~1/25 in Caucasians.
+  * '''Epidemiological Notes''': Higher prevalence in populations with high carrier rates (e.g., due to heterozygote advantage, like sickle cell anemia in malaria-endemic areas). Screening programs target carriers.
+  * '''Limitations''': Phenocopies (environmental mimics) can complicate diagnosis.
+* '''X-Linked Recessive (XR)''':
+  * '''Definition''': Mutation on the X chromosome; males (X^c Y) express with one copy due to hemizygosity, while females (X^C X^c) are carriers (often unaffected due to X-inactivation).
+  * '''Pedigree Characteristics''': No male-to-male transmission; affected males pass to all daughters (carriers) but no sons. Mother-to-son transmission; more males affected.
+  * '''Example''': Hemophilia A (mutations in F8 gene on Xq28). Bleeding disorder; historical prevalence in European royalty.
+  * '''Epidemiological Notes''': Skewed sex ratios in affected individuals; carrier testing in families. Lyonization (random X-inactivation) can cause mild symptoms in females.
+  * '''Limitations''': De novo mutations can appear without family history.
+* '''Additional Modes''':
+  * '''X-Linked Dominant (XD)''': Rare; affects females more (e.g., Rett syndrome, MECP2 gene). Lethal in males or mild; female-to-offspring transmission.
+  * '''Y-Linked''': Male-only transmission (e.g., azoospermia factors on Y chromosome). Rare, as Y has few genes.
+  * '''Mitochondrial''': Maternal inheritance (mtDNA from mother); affects both sexes but no father-to-child. Variable due to heteroplasmy (mixed mtDNA). Example: Leber's hereditary optic neuropathy.
+==== Probability in Pedigrees ====
+Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for [[AP_Statistics_Curriculum_2007_Prob_Rules | Bayesian]] risk assessment in genetic counseling.
+* '''Key Formula''': The likelihood of observing a pedigree under a genetic model is:
+  <math>P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i),</math>
+:where the product is over all individuals, <math>P(\text{genotype}_i)</math> is based on Mendelian laws and parental genotypes, and <math>P(\text{phenotype}_i | \text{genotype}_i)</math> accounts for penetrance.
+* '''Penetrance''': Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.
+* '''Variable Expressivity''': Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).
+* '''Example''': In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.
+* '''Applications''': Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).
+* '''Limitations''': Assumes accurate family history; ignores modifiers like environment.
+=== Linkage Analysis ===
+Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.
+* '''Genetic Linkage''': Occurs when genes are close on the same chromosome, violating independent assortment.
+* '''Recombination Fraction (θ)''': Proportion of gametes with recombination between two loci.
+: <math>\theta = 0.5</math>: No linkage (independent, >50 cM apart).
+: <math>\theta < 0.5</math>: Linkage (e.g., θ=0.1 means 10% recombination).
+: <math>\theta = 0</math>: Complete linkage (no recombination, syntenic loci).
+: Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.
+* '''Phases''': Coupling (alleles on same chromosome) vs. repulsion (opposite).
+* '''Applications''': Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).
+==== LOD Score ====
+The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.
+* '''Formula''': <math>Z(\theta) = \log_{10} \frac{L(\text{data} | \theta=\hat{\theta})}{L(\text{data} | \theta=0.5)}</math>, where \(\hat{\theta}\) is the maximum likelihood estimate of θ, and L is the likelihood function.
+* '''Calculation''': Involves summing over possible phases and recombinants in pedigrees.
+* '''Interpretation''': Positive Z favors linkage; threshold Z≥3 for significance (false positive rate 5%); Z≤-2 excludes linkage.
+  {|class="wikitable" style="text-align:center;width:50%" Border="1"
+  |-
+  ! LOD Score !! Odds Ratio !! Interpretation
+  |-
+  | Z ≤ -2 || ≤1:100 || Strong evidence against linkage (exclusion)
+  |-
+  | -2 < Z < 3 || Indeterminate || Suggestive but not conclusive
+  |-
+  | Z ≥ 3 || ≥1000:1 || Strong evidence for linkage (genome-wide significance)
+  |}
+* '''Example''': In a family with a disease and marker, if data is 1000 times more likely under θ=0.1 than 0.5, Z=3.
+* '''Limitations''': Requires large pedigrees; sensitive to model misspecification (e.g., wrong penetrance). Superseded by GWAS for complex diseases.
-====Pedigree Analysis and Probability in Genetics====
+=== Linkage Disequilibrium (LD) and Association ===
-*Mendel’s Law of Segregation: organism carry two copies of each genetic factor; there is segregation of parental factors during gamete formation; each gamete receives one genetic factor from each parent.
+LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.
-<center>
-[[Image:MSHS IntroEpi Fig 4 .png|400px]]
+* '''Causes of LD''': Mutation, selection, genetic drift, population admixture, or bottlenecks. High LD in isolated populations (e.g., Ashkenazi Jews).
-</center>
+* '''Decay''': LD decreases with genetic distance and time; measured in haplotype blocks.
+* '''Vs. Linkage''': Linkage is family-based (meiotic); LD is population-based (historical).
+* '''Applications''': Imputation in GWAS; fine-mapping; inferring evolutionary history.
+==== Measures of LD ====
+Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.
+* '''D (Disequilibrium Coefficient)''':
+: Formula: <math>D_{AB} = p_{AB} - p_A p_B.</math>
+: Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.
+: Limitations: Sensitive to allele frequencies; not comparable across loci.
+* '''D' (Normalized D)''':
+: Formula: if <math>D<0</math>, <math>D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))}</math>
+::: if  <math>D>0</math>, <math>D' = \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))}</math>.
+: Interpretation: Ranges -1 to 1; |D'|=1: Complete LD (no recombination evidence); |D'|<1: Partial.
+: Use: Detects historical recombination.
+* '''r² (Correlation Coefficient)''':
+: Formula: <math>r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}</math>.
+: Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).
+: Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.
+==== Additional Considerations ====
+* '''Haplotypes''': Combinations of alleles (e.g., estimated via EM algorithm in PHASE software).
+* '''Visualization''': LD plots (triangular heatmaps) using tools like Haploview.
+* '''Best Practices''': Account for population structure (e.g., via principal components) to avoid spurious LD. In epidemiology, LD informs polygenic risk scores.
+* '''Limitations''': LD varies by ancestry (e.g., higher in Europeans than Africans); tagging fails for rare variants.
+'''R Implementation for LD:'''
+This code computes D, D', and r² from example haplotype frequencies or counts. It uses base R for simplicity; for real data, consider packages like ''genetics'' or ''snpStats''.
+<pre>
+# Linkage Disequilibrium Measures
+# Input: Haplotype counts (e.g., from population data)
+# Assume biallelic loci A/a, B/b
+count_AB <- 120  # AB haplotypes
+count_Ab <- 80   # Ab
+count_aB <- 60   # aB
+count_ab <- 140  # ab
+total <- sum(c(count_AB, count_Ab, count_aB, count_ab))
+# Haplotype frequencies
+p_AB <- count_AB / total
+p_Ab <- count_Ab / total
+p_aB <- count_aB / total
+p_ab <- count_ab / total
-====Human Pedigree Nomenclature====
+# Allele frequencies
-<center>
+p_A <- p_AB + p_Ab
-[[Image:MSHS IntroEpi Fig 5 .png|600px]]
+p_a <- 1 - p_A
-</center>
+p_B <- p_AB + p_aB
+p_b <- 1 - p_B
+# D
+D <- p_AB - p_A * p_B
-====Modes of Inheritance====
+# D' (normalized)
-*Autosomal dominant: individuals that inherit the dominant disease allele, D will develop the disease. Homozygous (DD: affected), Heterozygous (Dd: affected), Homozygous (dd: normal).
+D_max_pos <- min(p_A * p_b, p_a * p_B)
-'''Traits:''' every affected individuals has an affected parent; there is 50% chance that each affect parent will transmit the trait to any child; the trait is expressed in both males and females is roughly equal numbers; two affected individuals may have unaffected children; the phenotype is often more severe in homozygous affected individuals.
+D_max_neg <- -min(p_A * p_B, p_a * p_b)
-*Autosomal recessive: individuals that inherit two copies of the recessive disease allele, d, will develop the disease. Homozygous (DD: normal), Heterozygous (Dd: normal), Homozygous (dd: affected).
+D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)
-====Traits====
+# r-squared
-For rare traits, most affected individuals are the children of unaffected parents; all of the children of two affected individuals are affected; the risk of an affected child from a mating a two heterozygotes is 25%; the trait is expressed in both males and females; for rare traits, the unaffected parents of an affected individual may be related to each other.
+r_sq <- D^2 / (p_A * p_a * p_B * p_b)
-*Sex-linked dominant (X-lined dominant): $X^{C} X^{C}$ affected female, $X^{C}$ Y affected male, $X^{C} X^{c}$ affected female, $X^{c}$ Y normal male, $X^{c} X^{c}$ normal female.
-*Sex-linked recessive (X-linked recessive): female that carry a recessive mutation $X^{C}$ will have affected male children. $X^{C} X^{C}$ normal female, $X^{C}$ Y normal male, $X^{C} X^{c}$ carrier female, $X^{c}$ Y affected male, $X^{c}X^{c}$ affected female.
-'''Traits:''' there is no male to male transmission; there is mother to son transmission; female can be homozygous and have the trait. Examples may be color blindness.
-*Probability in Pedigree Analysis: using Mendel’s Laws, we can estimate the probabilities of an offspring’s genotype if we know (or assume) a mode of inheritance; under Hardy-Weinberg, we can estimate genotype probabilities for parents.
+# Output
-**Steps: (1) choose a mode of inheritance; (2) establish the penetrance of each genotype under that mode of inheritance; (3) determine the potential genotypes of each person under that mode of inheritance; (4) determine the founder genotype probabilities (parent generation); (5) determine the transmission probabilities (offspring generation); (6) calculate the probabilities of each pedigree member given their phenotype, their genotype, and the penetrance of the disease ''P(member)=P(phenotype and genotype)=P(genotype)*P(phenotype|genotype);'' (7) calculate the total probability of the pedigree ''P(pedigree)''$=∏_{i=1}^{n}$''P(genotype)*P(phenotype|genotype)'', n is number of people in the pedigree.
+cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")
+cat("D =", round(D, 4), "\n")
+cat("D' =", round(D_prime, 4), "\n")
+cat("r-squared =", round(r_sq, 4), "\n")
+if (abs(D_prime) == 1) cat("Complete LD detected.\n") else cat("Partial or no LD.\n")
+</pre>
-In step 2, there can be complete penetrance or incomplete penetrance. With complete penetrance, individuals’ phenotype will always match their genotype.
-If a genotype has incomplete penetrance, some individuals with the ‘affected’ genotype will not exhibit the ‘affected’ phenotype. This happens often when the development of the phenotype is controlled by more than one gene or is modified by environmental factors.
-*Founder effect: in small populations, rare recessive alleles present in a member of the original group of settlers is transmitted through successive generations; population expands and remains geographically and culturally isolated. After ~10 generations, children with recessive disease begin to appear, inbreeding is not usually a significant feature of the population.
+==== Genome-Wide Association Studies (GWAS) ====
-Consanguineous mating: inbreeding (consanguinity) = mating between genetically related individuals; degree of inbreeding based on the probability that an individual will inherit two alleles that came from a common ancestor; homozygousity due to inheritance of alleles that are “Identical by Descent” (IBD).
-====Linkage Analysis====
+GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.
-Genetic linkage refers to the study of the order of genes on chromosomes; distance between genes (aka genetic distance).
-*Recombination fraction: a measure of distance between genes; alleles that are physically very close to one another on a chromosome tend not be separated by recombination as often as alleles that are physically far from one another. The symbol $\theta$ is used to show the probability that the alleles of two genes will recombine during gamete formation, it equals to the proportion of gametes that are recombinant = probability of recombination = recombination fraction.
-*When two loci are inherited independently of each other, recombinants and non-recombinants are found in equal proportions in the offspring: $\theta$=0.5.
-*When two loci are inherited together because of chromosome location, there are more non-recombinants than recombinants in the offspring: $0\le\theta\le0.5$.
-*$\theta=0.5$ implies no linkage; $\theta=0$ implies complete linkage; $0<\theta<0.5$ implies linkage.
-====Parametric Linkage Analysis====
+* Statistical Model: Typically uses logistic regression for case-control studies.
-Parametric Linkage Analysis involves
+: <math>\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}.</math>
-: (1) collecting pedigree data with many meiotic events (need multiple generations or many children);
-: (2) making assumptions about how the disease is inherited (single locus vs. multilocus; dominant vs. recessive; penetrance; allele frequencies of disease susceptibility locus);
-: (3) can be done with phase known (know how the alleles are distributed on parental chromosomes) or phase unknown (know which alleles come from which parent, but not how they are distributed on the chromosomes) pedigrees.
-*For phase known pedigree: recombination fraction $\theta$=$\frac{\#\,recombinants}{\#\,informative\,meiosis'}$, where informative meiosis = parental gamete formation that provides information about recombination between two loci.
+* Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., <math>p < 5 \times 10^{-8}</math>) are required to avoid false positives.
-*For phase unknown pedigree: parent haplotypes are not known, $\theta$ cannot be estimated directly because there is no way to tell whether the offspring are recombinant or non-recombinant.
-*Estimation of $\theta$ by the Maximum Likelihood Method: estimate $\theta$ for any pedigree by using MLE. This equation takes the data in the pedigree and asks the question, what $\theta$ would result in the largest L($\theta$) given the number of recombinant and non-recombinant gametes? L($\theta)=c(1-\theta)^{n-k}\theta^{k}$, where c is a constant, n is the number of informative meioses, k is the number of recombinant meioses, n-k is the number of non-recombinant
-*$\ln \big (L(\theta)\big )= \ln(c)+(n-k)\ln(1-\theta)+k\ln (\theta)$, $\frac{∂lnL\theta}{∂\theta}=\frac {-n-k} {1-\theta}+\frac{k}{\theta}$=0,=$\hat {\theta}$ = $\frac{k}{n}$.
+=== Gene-Environment Interactions ===
-*Logarithm (base 10) Of Odds (LOD) score: The LOD score compares the likelihood of observing the test data if the two loci are indeed linked, to the likelihood of observing the same data purely by chance. LOD>0 indicate the presence of linkage, LOD<0 indicate that linkage is less likely.
+Disease risk is often modeled as a combination of genetics (<math>G</math>), environment (<math>E</math>), and their interaction (<math>G \times E</math>).
+: <math>Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon.</math>
-In linkage analysis, the two hypotheses are  $H_{0}$: no linkage, $\theta$=0.5; $H_{1}$linkage, $\theta=\hat{\theta}$.LOD score Z($\theta$)=$log_{10}\frac{L(\theta=\hat{\theta})}{L(\theta=\frac{1}{2})}$,where $L(\theta=\hat{\theta}$ is the likelihood equation when $\theta=\frac{1}{2}$.
+Interaction Models:
+* Synergistic: Genotype exacerbates the risk factor (or vice versa).
+* Independent: Both factors influence risk but do not interact.
 <center>
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+[[Image:SMHS Epi Figure 11.png| 500 px]]
-|-
+<br>''Model: Genotype exacerbates the effect of the risk factor''
-|   $\theta$     || LOD score
-|-
-| $\theta>0$   || $Z (\theta) = nlog(2) + k * log(\theta) + (N - k)log (1-\theta)$
-|-
-| $\theta = 0$and k = 0  || $Z (\theta) = nlog(2)$
-|-
-| $\theta = 0$ and k > 0  || $Z(\theta) =-\infty$
-|-
-|}
 </center>
-To test one hypothesis on multiple pedigrees, add the LOD scores of each individual pedigree to determine a final LOD score:
+=== Core Epidemiological Measures ===
-$Z(\theta)=\sum Z_{i}(\theta)$ for i=1,…,''n'' pedigrees
+In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.
+==== Absolute Risk Reduction (ARR) ====
+* '''Definition''': The difference in event rates (incidences) between a control (or unexposed) group and a treatment (or exposed) group. ARR measures the absolute effect of an intervention or exposure on outcome risk.
+* '''Formula''': <math>ARR = I_{\text{control}} - I_{\text{treatment}}</math>, where \(I\) represents incidence (proportion of events).
+* '''Interpretation''': A positive ARR indicates risk reduction (benefit); a negative ARR indicates increased risk (harm). It is straightforward but does not account for baseline risk.
+* '''Example''': If the incidence of heart attacks is 10% in the control group and 7% in the treatment group, ARR = 0.10 - 0.07 = 0.03 (3% absolute reduction).
+* '''When to Use''': Prospective studies like randomized controlled trials (RCTs); useful for communicating tangible benefits to patients.
+* '''Limitations''': Sensitive to baseline risk; not ideal for comparing across populations with different event rates.
-$\theta$ is the value of $\theta$ that maximizes L($\theta$) = MLE.
+==== Relative Risk (RR) ====
-LOD scores correspond to the following odds:
+* '''Definition''': The ratio of the incidence of an outcome in the exposed group to that in the unexposed group. RR assesses how much an exposure increases or decreases the probability of an event.
+* '''Formula''': <math>RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}</math>.
+* '''Interpretation''': RR > 1 indicates increased risk due to exposure; RR < 1 indicates protective effect; RR = 1 indicates no association. It is multiplicative and accounts for baseline risk.
+* '''Example''': In a cohort study, smokers have a 20% lung cancer incidence, while non-smokers have 2%. RR = 0.20 / 0.02 = 10 (smokers are 10 times more likely to develop lung cancer).
+* '''When to Use''': Cohort studies or RCTs; preferred for common outcomes.
+* '''Limitations''': Can overestimate associations for rare events; not applicable in case-control studies.
-<center>
+==== Odds Ratio (OR) ====
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+A 2×2 contingency table with odds ratio formula.
+{| class="wikitable" style="text-align:center; width:60%;"
+|+ '''2 × 2 Contingency Table for Odds Ratio'''
+!
+! Exposed
+! Unexposed
+! Total
 |-
-|$Z(\theta)=-2$  || 100:1 odds against linkage, significantly in favor of no linkages
+! Cases (Diseased)
+| '''a'''
+| '''c'''
+| a + c
 |-
-|$Z(\theta)=-1$  ||  10:1 odds against linkage
+! Controls (Healthy)
+| '''b'''
+| '''d'''
+| b + d
 |-
-|$Z(\theta)=0$   || Not informative
+! Total
+| a + b
+| c + d
+| '''N'''
 |-
-|$Z(\theta)=+1$  || 10:1 odds in favor of linkage
+| colspan="4" style="text-align:left; background:#f9f9f9;" |
-|-
+'''Odds Ratio (OR) Calculation:'''
-|$Z(\theta)=+2$  || 100:1 odds in favor of linkage
+$$OR = \frac{a / c}{b / d} = \frac{ad}{bc}$$
-|-
+|}
-|$Z(\theta)=+3$  || 1000:1 odds in favor of linkage, significantly in favor of linkage
-|}
-</center>
+Formula: <math>OR = \frac{ad}{bc}</math>
+: Odds Ratio = [Exposed Cases × Unexposed Non-cases] / [Exposed Non-cases × Unexposed Cases].
+* '''Definition''': The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group. OR approximates RR when the outcome is rare.
+* '''Formula''': From a 2x2 contingency table (a = exposed cases, b = exposed non-cases, c = unexposed cases, d = unexposed non-cases): <math>OR = \frac{ad}{bc}</math>.
+* '''Interpretation''': OR > 1 suggests positive association; OR < 1 suggests inverse association; OR = 1 suggests no association. It is often used in logistic regression.
+* '''Example''': In a case-control study of diabetes and obesity: 80 obese diabetics (a), 20 obese non-diabetics (b), 30 non-obese diabetics (c), 70 non-obese non-diabetics (d). OR = (80*70) / (20*30) = 9.33 (obesity increases odds of diabetes by over 9 times).
+* '''When to Use''': Case-control studies or when incidence data is unavailable; common in meta-analyses.
+* '''Limitations''': Not directly interpretable as risk for common outcomes; can differ from RR if events are frequent.
+==== Number Needed to Treat (NNT) or Harm (NNH) ====
+* '''Definition''': The average number of patients who need to be treated (or exposed) to prevent (or cause) one additional outcome compared to the control. NNT is based on ARR and translates statistical effects into clinical relevance.
+* '''Formula''': <math>NNT = \frac{1}{|ARR|}</math> (use absolute value for magnitude; sign of ARR determines benefit vs. harm).
+* '''Interpretation''': Lower NNT indicates greater treatment efficacy. If ARR > 0, it's NNT (benefit); if ARR < 0, it's NNH (harm). Infinite NNT means no effect.
+* '''Example (Benefit)''': ARR = 0.03 (as above), NNT = 1 / 0.03 ≈ 33.3 (treat 33 patients to prevent one heart attack).
+* '''Example (Harm)''': If treatment increases incidence from 50% to 80%, ARR = 0.50 - 0.80 = -0.30, NNH = 1 / 0.30 ≈ 3.3 (treat 3 patients to cause one additional bad outcome).
+* '''When to Use''': RCTs or systematic reviews; helps in shared decision-making and cost-benefit analysis.
+* '''Limitations''': Assumes constant ARR; sensitive to time frame and baseline risk. Confidence intervals should be reported for real-world application.
+==== Additional Considerations ====
+* '''Confidence Intervals (CI)''': Always compute 95% CIs for these measures to assess precision (e.g., using bootstrap methods or formulas in R packages like ''epitools'' or ''epiR'').
+* '''Attributable Risk (AR)''': Extends RR; AR = \(I_{\text{exposed}} - I_{\text{unexposed}}\) (absolute risk due to exposure).
+* '''Population Attributable Risk (PAR)''': PAR = \(I_{\text{population}} - I_{\text{unexposed}}\) (proportion of cases attributable to exposure in the population).
+* '''Best Practices''': Adjust for confounders using multivariable models; interpret in context (e.g., RR may seem large for rare events but have small absolute impact).
+* '''Software Tools''': R (with packages like ''epiR'', ''Epi'', or ''survival'' for advanced metrics like Hazard Ratios) or Python (with ''scipy'' or ''lifelines'') are commonly used.
+'''R Implementation for Key Measures:'''
+This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.
+<pre>
+# Install if needed: install.packages("epiR")  # But assuming it's available or use base R
+# Example data: 2x2 table for OR/RR (cohort study assumption)
+# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
+a <- 20  # Exposed cases
+b <- 80  # Exposed non-cases
+c <- 2   # Unexposed cases
+d <- 98  # Unexposed non-cases
+# Incidences
+I_exposed <- a / (a + b)
+I_unexposed <- c / (c + d)
+# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
+ARR <- I_unexposed - I_exposed  # Positive if treatment reduces risk
+# Relative Risk
+RR <- I_exposed / I_unexposed
+# Odds Ratio
+OR <- (a * d) / (b * c)
+# Number Needed to Treat/Harm
+NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
+type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))
+# Output
+cat("Incidence Exposed:", round(I_exposed, 3), "\n")
+cat("Incidence Unexposed:", round(I_unexposed, 3), "\n")
+cat("ARR:", round(ARR, 3), "\n")
+cat("RR:", round(RR, 3), "\n")
+cat("OR:", round(OR, 3), "\n")
+cat(type, ":", ifelse(is.finite(NNT), round(NNT, 1), "Infinite"), "\n")
+# Harm example (swap for treatment increasing risk)
+I_control <- 0.50
+I_treatment <- 0.80
+ARR_harm <- I_control - I_treatment
+NNT_harm <- 1 / abs(ARR_harm)
+cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")
+cat("NNH:", round(NNT_harm, 1), "\n")
+</pre>
--2 < LOD < 3 provides only weak (non-significant) evidence for or against linkage. LOD scores vary with $\theta$: calculate LOD scores for a range of $\theta's$, find one that maximizes Z($\theta$); vary with data: each pedigree gives different information; are additive across independent pedigrees: sum data from all pedigrees to get final Z score.
+The following code uses a custom function to handle both ''benefit'' (reduction in risk) and ''harm'' (increase in risk) scenarios dynamically.
-====Example 1====
+<pre>
-Suppose the father and mother are both Dd (dd is the recessive affected individual, Dd the heterozygous carrier individual, and DD the homozygous normal individual). The table below shows the Mendelian ration of $\frac{3}{4}$ normal to $\frac{1}{4}$ affected. For most ''autosomal recessive diseases'', the heterozygote cannot be distinguished from the normal homozygote. In the normal phenotype categories of offspring (Dd and DD produce the same normal phenotype), two of the three are heterozygotes (carriers); one of the three is homozygous normal
+# Function to calculate Epidemiological Metrics
+calculate_epi_stats <- function(a, b, c, d) {
+# Input validation
+if (any(c(a, b, c, d) < 0)) stop("Cell counts must be non-negative")
-<center>
+# Basic Incidences
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+i_exp <- a / (a + b)
-|-
+i_unexp <- c / (c + d)
-| || '''Father'''
-|-
-|'''Mother''' || [[Image:SMHS_Epi_Figure_6.png|200px]]
-|-
-|}
-</center>
-<center>
+# Absolute Risk Reduction (ARR)
-[[Image:SMHS_Epi_Figure_7.png|400px]]
+# Defined as Control Risk - Treated Risk
-</center>
+arr <- i_unexp - i_exp
-This pedigree example illustrates autosomal recessive inheritance. I-1 and I-2 are unrelated but produced an affected offspring (affected offspring have normal parents). By chance, they both must have been carriers. Even though II-2 is affected, she produced no affected offspring (i.e., the phenotype appears in siblings, not parents). As the probable genotype for an outside individual (II-1) is homozygous normal, III-1, III-2 and III-3 must be carriers (heterozygotes). They are not affected but could only have inherited the recessive gene from II-2. Next, II-3, II-5, and II-6 each have a $\frac{2}{3}$ chance of being a carrier and a $\frac{1}{3}$ chance of being homozygous normal. They are not affected, but they are carrier*carrier offsprings. Like I-1, II-4 and II-7 have a high probability of being homozygous normal as they are outside the family. III-4, III-5, III-6, III-7, III-8, and III-9 all have a $\frac{1}{3}$ chance of being carriers and a $\frac{2}{3}$ chance of being homozygous normal. One parent of each is probably homozygous normal, the other has a $\frac{2}{3}$ chance of being a carrier and a 1 in 2 chance of passing on the recessive allele if they were a carrier.
+# Relative Risk (RR) and Odds Ratio (OR)
+# Adding small constant (0.5) if zero to avoid division by zero errors
+rr <- i_exp / i_unexp
+or <- (a * d) / (b * c)
+# NNT / NNH Logic
+nnt_val <- if (arr != 0) 1 / abs(arr) else Inf
+nnt_label <- ifelse(arr > 0, "NNT (Benefit)",
+ifelse(arr < 0, "NNH (Harm)", "No Effect"))
-====Example 2====
+# Return formatted list
-Linkage mapping using pedigrees is the disease linked to the marker given the pedigree below. (Dominant inheritance)
+return(list(
+Incidence_Exposed = round(i_exp, 4),
+Incidence_Unexposed = round(i_unexp, 4),
+ARR = round(arr, 4),
+RR = round(rr, 2),
+OR = round(or, 2),
+Measure = nnt_label,
+Value = round(nnt_val, 1)
+))
+}
-<center>
+# --- Example 1: Beneficial Treatment ---
-[[Image:SMHS_Epi_Figure_8.png|250px]]
+# (e.g., Vaccine trial: Exposed = Vaccinated)
-</center>
-*Two problems: (1) we don’t know the phase, even if the genes are linked, we don’t know arrangement of alleles (cis or trans) on the chromosomes in Dad: D1 d2 or D2 d1. Solution: take the average of the likelihoods of linkage: L($\theta)=\frac{1}{2}L(\theta|phase 1)+\frac {1}{2}L(\theta|phase 2)$; (2) how can we compare the probability of linkage to the probability of no linkage. Solution: take the ratio (i.e. odds) of the likelihood of linkage [L($\theta$)=L(MLE of $\theta$)] versus the likelihood of no linkage [L($\theta$)=L($\theta$=0.5)].
+result_benefit <- calculate_epi_stats(a=20, b=80, c=40, d=60)
+print(result_benefit)
-*Calculating LOD scores: (1) if the phase is D1 d2, then there are 4 non-recombinants and 1 recombinant, L($\theta$)=(1-$\theta)^{4}\theta$; (2) if the phase is D2 d1, then there are 4 recombinants and 1 non-recombinant, L($\theta)=\theta^{4}(1-\theta)$.
+# --- Example 2: Harmful Exposure ---
+# (e.g., Smoking vs Lung Cancer)
+result_harm <- calculate_epi_stats(a=50, b=50, c=10, d=90)
+print(result_harm)
+</pre>
-For $\theta=0.1(10cm)$, phase $1,L(\theta)=(1-0.1)^{4}*0.1=0.9^{4}*0.1$, for phase 2,$\theta =0.1:L(θ)=0.1^{4} (1-0.1)=0.1^{4}*0.9.$ At $\theta=0.5, L (\theta)=0.5^{5}$, $Z(\theta)=log_{10}\frac{{[0.9^{4}0.1+0.1^{4}0.9]}\over{2}}{0.5^{5}} = 0.0217.$
-For other| values of $\theta$, do the similar calculation: so the MLE of $\theta$, $\hat{\theta}= 0.20$.
+To understand the output and interpret the results, we use the following logic.
-<center>
+{| class="wikitable" style="width:100%;"
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+! Metric
+! Interpretation
+! Calculation
 |-
-|$\theta$ || $L(\theta$) || $L=(\theta=0.5)$ || $Z(\theta)$
+| '''ARR'''
-|-
+| The actual difference in risk between groups.
-|0	||0	||0.03125|| $-\infty$
+| <math>I_{unexposed} - I_{exposed}</math>
-|-
-|0.05	||0.02037||	0.03125	||-0.18586
 |-
-|0.10	||0.03285||	0.03125||	0.02169
+| '''RR'''
+| How many times more likely the outcome is in the exposed group.
+| <math>I_{exposed} / I_{unexposed}</math>
 |-
-|0.15	||0.03937	||0.03125	||0.10032
+| '''OR'''
+| The ratio of the odds of the outcome occurring in the exposed group.
+| <math>\frac{a \times d}{b \times c}</math>
 |-
-|'''0.20'''	||'''0.0416'''	||'''0.03125'''	||'''0.12424'''
+| '''NNT'''
-|-
+| The number of patients needed to treat to prevent '''one''' bad outcome.
-|0.25||	0.04102	||0.03125|| 0.11815
+| <math>1 / ARR</math>
-|-
+|}
-|0.30	||0.0385||	0.03125	||0.09454
-|-
-|}
+''Note on NNT/NNH'': If the ARR is positive (risk decreased), we report '''Number Needed to Treat (NNT)'''. If the ARR is negative (risk increased), we report '''Number Needed to Harm (NNH)'''.
-</center>
+''Confidence Intervals:'' In a real-world study, we typically report 95% CIs. Using `prop.test()` in base R to get CIs for risk differences.
+''Log Transformation:'' For OR and RR, calculations for CIs are typically done on the  scale to ensure symmetry.
+===Rmd/R Epidemiological simulation of a hypothetical viral epidemic===
+We'll try to illustrates genomics tracking, prediction, and spatio-temporal quantification
+of a viral epidemic. For this, we'll build a high-fidelity epidemiological simulation in R,
+integrating three distinct domains: ''compartmental modeling'' (SIR/SEIR), ''phylodynamics'' (genomic tracking), and ''spatial movement''.
+To keep this self-contained and realistic, we will use a ''Stochastic Agent-Based Model (ABM)'' approach. This allows us to track individual "viral sequences" as they mutate and spread across a coordinate system.
+The R simulation requires the following libraries:
+: ''tidyverse'': For data manipulation and plotting.
+: ''gganimate'': For spatio-temporal visualization.
+: ''adegenet'' or ''ape'': For handling ''genomic'' distances.
+: ''sf'': for projecting this onto real-world maps.
+==== The Core Simulation Engine====
+This script generates a synthetic population, simulates transmission based on proximity, and introduces ''mutations'' (modeled as bit-strings or distance matrices).
+<pre>
+library(tidyverse)
+library(gganimate)
+# --- Setup Parameters ---
+set.seed(42)
+n_agents <- 200
+map_size <- 100
+initially_infected <- 3
+mutation_rate <- 0.05  # Probability of a genomic shift per transmission
+timesteps <- 50
+# --- Initialize Population ---
+pop <- tibble(
+  id = 1:n_agents,
+  x = runif(n_agents, 0, map_size),
+  y = runif(n_agents, 0, map_size),
+  state = "S",
+  genome = 0,        # Base viral lineage
+  parent_id = NA,
+  infect_time = NA
+)
-====Linkage Disequilibrium====
+# Infect patient zeros
-*Linkage vs. Linkage Disequilibrium: linkage refers to the observation that two loci are inherited together (rather than be separated by recombination) in a single generation; linkage disequilibrium refers to the pattern of correlation between loci at the population level.
+pop$state[1:initially_infected] <- "I"
-*Linkage and Association: linkage is the relationship between loci, and is examined within families; association is the relationship between alleles and is examined within populations.
+pop$infect_time[1:initially_infected] <- 0
-*Linkage Disequilibrium (LD): describes the tendency of alleles to be inherited together more often than would be expected under random segregation. Extend of LD reflects the population’s history and the distance between markers. LD mapping is a promising approach for mapping genes, especially for complex-trait diseases. It is a population-based concept (not an individual or family-based concept); it has expected and observed values: looks at haplotypes instead of genotypes, observed frequencies are for haplotypes, expected haplotypes frequencies are calculated from allele frequencies.
+pop$genome[1:initially_infected] <- sample(1:100, initially_infected)
-*Forces affecting LD: (1) recombination: breaks up allelic association; (2) gene conversion: during recombination, DNA sequence information is transferred from one chromatid to another; (3) recurrent mutation: same mutation arises on different haplotype backgrounds; (4) natural selection: keeps pairs of genes/SNPs together; (5) demographic history: migration, non-random mating.
+# --- Run Simulation ---
+history <- list()
-*Linkage equilibrium:$p_{ab}=p_{a} p_{b}=(1-p_{A})(1-p_{B})$ ; $p_{AB}=p_{A} p_{B}$; $p_{Ab}=p_{A} p_{b}=p_{A} (1-p_{B})$; $p_{aB}= p_{a}p_{B} = (1-p_{A})p_{B};$
+for (t in 1:timesteps) {
-**Linkage disequilibrium:$p_{ab}\not=p_{a} p_{b}=(1-p_{A})(1-p_{B})$ ; $p_{AB}=p_{A} p_{B}$; $p_{Ab}=p_{A} p_{b}=p_{A} (1-p_{B})$; $p_{aB}= p_{a}p_{B} = (1-p_{A})p_{B};$
+  # 1. Movement (Random Walk)
+  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
+  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
+  # 2. Transmission
+  infected <- which(pop$state == "I")
+  susceptible <- which(pop$state == "S")
+  for (i in infected) {
+    # Find neighbors in range
+    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
+    contacts <- susceptible[dists < 5] # Transmission radius
+    if (length(contacts) > 0) {
+      for (target in contacts) {
+        if (runif(1) < 0.3) { # Prob of infection
+          pop$state[target] <- "I"
+          pop$infect_time[target] <- t
+          pop$parent_id[target] <- pop$id[i]
+          # Genomic Mutation: Slight drift from parent genome
+          pop$genome[target] <- pop$genome[i] +
+            sample(c(-1, 0, 1), 1, prob = c(mutation_rate/2, 1-mutation_rate, mutation_rate/2))
+        }
+      }
+    }
+  }
+  # 3. Recovery
+  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
+  history[[t]] <- pop %>% mutate(time = t)
+}
-*Measures of LD: fundamental measure: Disequilibrium coefficient (D); most commonly used measures: D’ and |D’|; $r^{2}$ or $\Delta^{2}$.
+sim_data <- bind_rows(history)
-**$D_{AB}$ is the disequilibrium coefficient for locus A and locus B. D is hard to interpret.
+</pre>
-:: $D_{AB}=p_{AB}-p_{A} p_{B};$   $p_{AB}=p_{A} p_{B}+D_{AB};$
+Here is the structure of the simulated dataset.
-:: $D_{aB}=p_{a} p_{B}-D_{AB};$   $D_{ab}=p_{a} p_{b}+D_{AB};$
-$D'_{AB}=\begin{cases}\frac{D_{AB}}{min(p_{A} p_{B},p_{a} p_{b})},D_{AB} \\\frac{D_{AB}} {min(p_{A} p_{b},p_{a} p_{B})},D_{AB}>0,\end{cases}$
+<pre>
-it ranges between -1 and +1, when allele frequencies are small, D' is more likely to take extreme values, D’ of -1 or +1 implies that least one potential haplotype was not observed (no evidence for recombination between markers).
+> str(sim_data )
+tibble [10,000 × 8] (S3: tbl_df/tbl/data.frame)
+ $ id         : int [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
+ $ x          : num [1:10000] 89 92.5 28 83.5 65.4 ...
+ $ y          : num [1:10000] 86.5 50.2 85.2 47.1 16.4 ...
+ $ state      : chr [1:10000] "I" "I" "I" "S" ...
+ $ genome     : num [1:10000] 80 53 24 0 0 0 0 0 0 0 ...
+ $ parent_id  : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
+ $ infect_time: num [1:10000] 0 0 0 NA NA NA NA NA NA NA ...
+ $ time       : int [1:10000] 1 1 1 1 1 1 1 1 1 1 ...
+</pre>
+==== Genomic Tracking & Phylogeny====
-$r^{2}$  or $\Delta^{2}=\frac{D_{AB}}{p_{A}(1-p_{A})p_{B}(1-p_{B})}$ it ranges between 0 and 1; $r^{2}=1$ when the markers provide identical information; $r^{2}= 0$ when the markers are in perfect equilibrium; not as strongly affected by extreme allele frequencies.
+To visualize the "Genomic Tracking," we treat the ''genome'' value as a proxy for a sequence.
+In a real scenario, this would be a high-dimensional Hamming distance, but here we can visualize the
+''Clade Divergence'' over time.
+* Visualizing the Spatio-Temporal Spread: Using ''gganimate'', we can see how the virus clusters geographically and how the "Genomic Signature" (color) evolves.
-Information from measurements: when D’ is high and $r^{2}$  is high indicates the tendency toward presence of only 2 haplotypes, with similar allele frequencies of the 2 loci; when D’ is high and $r^{2}$ is low indicates the tendency toward presence of only 3 haplotypes (for example, a young SNP ancestrally), with large difference in allele frequencies of the 2 loci; when D’ is low and $r^{2}$ is low indicates the tendency toward presence of all 4 haplotypes and random coupling of alleles.
+<pre>
+anim <- ggplot(sim_data, aes(x = x, y = y, color = as.factor(genome), size = state)) +
+  geom_point(alpha = 0.7) +
+  scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
+  theme_minimal() +
+  labs(title = "Viral Spread & Genomic Drift: Time {frame}",
+       color = "Genomic Lineage",
+       size = "Infection Status") +
+  transition_time(time)
-*Disequilibrium will decay each generation in a large population, after t generations: $D_{t}=(1-\theta)^{t} D_{0}$.
+# animate(anim) # Uncomment to render
+</pre>
+====Realistic Epidemiological Quantification====
-====Genetic Testing Issues====
+To make this "highly realistic," we should calculate the Effective Reproduction Number
-*Genetic Test: analysis of chromosomes, genes and/or gene products (proteins) to determine whether there is an abnormality present that is causing or will cause a genetic condition/disorder.
+(<math>R_t</math>) and the Spatio-Temporal Kernel
-*Proportion of Genes Shared: first degree relatives (50%): sibs, parents, children,  dizygotic twins; second degree relatives (25%): uncles, aunts, nieces, nephews, grandparents, grandchildren, half-sibs, double first cousins; third degree relatives (12.5%): first cousins, half-uncles/aunts, half-nieces/nephews.
+. Realistic Epidemiological Quantification
-*Genetic tests are different from other medical tests:
-*Medical tests done to detect a current medical condition. It may be done on healthy individuals to determine future risk for a genetic condition (predictive tests).
-*Genetic test may need to be performed on affected family member before patient can be tested.
-*Genetic test results may have implications for healthcare and life decisions of other family members.
-*Possible insurance implications and potential for stigmatization and discrimination.
-*Genetic tests currently offered on a population level: newborn screening; multiple marker screening for pregnant women to screen for increased risk for chromosome abnormalities (e.g. Down syndrome) and neural tube defects (e.g. spina bifida); Cystoic fibrosis screening offered preconceptionally and to pregnant couples; carrier testing for Tay-Sachs disease, Sickle cell anemia; Prenatal testing offered to women 35 and older.
-*Genetic testing is currently offered to individuals with symptoms consistent with a genetic condition to establish, confirm or rule out a diagnosis; family history of a genetic condition.
-====Uses of Genetic Testing====
+<math>R_t=\frac{\text{New Infections at time}\ t}{\text{Infectious Individuals at time}\ t−1}.</math>
-*Diagnostic Testing: used to establish or confirm a diagnosis in a patient who has symptoms suggestive of a genetic condition.
-*Predictive testing: (1) presymptomatic, eventual development of symptoms is certain when the gene mutation is present; (2) predispositional (eventual development of symptoms is likely but not certain when the gene mutation is present, e.g., breast cancer.
-*Carrier testing: offered to patients based on family history of a genetic condition or ethnicity.
-*Prenatal testing: offered during pregnancy to assess fetal status when there is an increased risk of having a child with a genetic condition due to maternal age, family history, ethnicity, abnormal screening test or ultrasound evaluation.
-*Preimplantation genetic diagnosis (PGD): used with IVF and involves genetic testing on early embryos prior to implantation to rule out a genetic condition.
-*Newborn screening: screen for genetic disorders which can result in severe medical problems, metal retardation or even death in newborns. Some of the conditions can benefit form early treatment.
-*Pharmacogenetic testing: to identify gene changes in metabolic pathways that determine drug response, both for therapeutic effects and adverse effects. Examples may be genetic testing for CYP450.
-*Prognostic testing.
-*Forensic testing.
-*Identity testing: who is the parent.
-====Genetic Testing Challenges====
-*Testing for many genetic conditions not yet standard of care. Few practice guidelines exist.
-*Laboratories may offer different testing, even for the same genetic condition.
-*Many genetic tests are costly, low detection rate.
-*Genetic test results may be uninterpretable.
-*Moving target; rapid evolution of information.
+* Quantifying the Clusters: We can use a K-nearest neighbor (KNN) or DBSCAN algorithm on the ''sim_data'' to identify ''hotspots''
+where specific genomic lineages are dominating. This mimics real-world ''Genomic Surveillance'' reports.
-====Genetic Association Studies====
+<pre>
-Genome-wide association studies (GWAS) are observational studies that test for a statistically significant correlation between a genetic marker (the exposure) and a phenotype (the outcome).
+# Example quantification of lineage prevalence
+library(dplyr)
+library(ggplot2)
-*Genetic marker = any measurable genetic, polymorphism: varies across individuals, groups, or populations can occur in coding or non-coding regions of the genome.
+lineage_summary_plot <- sim_data %>%
-*Phenotype = any measurable trait: quantitative (height, blood pressure, glucose levels); qualitative (heart disease, cancer, hair color).
+    filter(state == "I") %>%
-*Study question: need to identify the population of interest (age, race, gender, geographic), the phenotype being studied and whether the study will look at specific genes with biological/positional relevance or agnostically search for new genomic regions of interest.
+    # Use group_by to organize the data by time and genome
-*Measures: methods for measuring both must be characterized (Valid? Reliable?); must be described (quantitative phenotypes such as normally distributed, mean, variance and qualitative phenotypes such as blood pressure vs. hypertensive).
+    group_by(time, genome) %>%
-*Sampling: family based (twin studies, heritability studies, linkage studies) vs. population based sampling
+    summarise(count = n(), .groups = "drop") %>%
-*Statistics: statistical tests for genetic association studies largely determined by type of sampling and type of outcome.
+    ggplot(aes(x = time, y = count, fill = as.factor(genome))) +
-*Inferences: association is not sufficient to prove causation. A positive statistical finding does not definitively mean the polymorphism tested is causing the outcome. A statistical association may be a result of direct causal relationship between SNP and outcome; confounding (linkage disequilibrium; population stratification); spurious association, false positive result.
+    geom_area() +
+    labs(title = "Genomic Lineage Succession", x = "Days", y = "Active Cases")
-====Gene-environment and Gene-gene Interactions====
+# Display the plot
-*Genes and Environment: almost all disease have a genetic component and an environmental component. How do these components interact? How to account for both effects?
+lineage_summary_plot
-*Model 1: Neither the genotype nor the environment alone increase risk
+</pre>
-<center>
-[[Image:SMHS Epi Figure 9.png |400 px]]
-</center>
+To extend this code into an Rmd notebook, we can include
-*Model 2: The genotype exacerbates the effect of the risk factor
+. Parameter Sweeping: Wrap the simulation in a function and run it with different  values.
+. Network Analysis: Use the `parent_id` column to create a transmission tree using the ''igraph'' package.
+. Kernel Density: Add a ''geom_density_2d()'' layer to the map to show spatial intensity.
-<center>
+Let's expand the simulation from a simple proxy to a high-fidelity ''Phylodynamic Model''
-[[Image:SMHS Epi Figure 11.png| 500 px]]
+by simulating a '''founder DNA sequence''' and allow it to evolve through point mutations (substitutions) as it spreads across the population.
-</center>
+* DNA Sequences & Mutation: Instead of a single number, each agent now carries a character string. When transmission occurs, there is a probability that one or more bases in the sequence will mutate.
-*Model 3: The risk factor exacerbates the effect of the genotype.
+<pre>
+library(tidyverse)
+library(igraph)
+library(ggraph)
+library(stringdist) # For Hamming Distance
-<center>
+# --- Configuration ---
-[[Image:SMHS Epi Figure 12.png| 500 px]]
+seq_length <- 50
-</center>
+bases <- c("A", "C", "G", "T")
+mutation_rate <- 0.01 # Probability per base per transmission
-*Model 4: The genotype and the risk factor each influence risk by themselves:
+# Create a starting sequence (Patient Zero)
+founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")
-<center>
+# Helper function to mutate DNA
-[[Image:SMHS Epi Figure 13.png| 500 px]]
+mutate_dna <- function(sequence) {
-</center>
+  seq_vec <- strsplit(sequence, "")[[1]]
+  mutation_mask <- runif(seq_length) < mutation_rate
+  if(any(mutation_mask)) {
+    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
+  }
+  return(paste(seq_vec, collapse = ""))
+}
+</pre>
-The study of gene-environment interactions, use epidemiologic techniques to identify both components (gene and environment):
+* Genetic Distance Matrix: To quantify the "genomic tracking" aspect, we use the ''Hamming Distance'', which counts the number of positions at which the corresponding nucleotides are different.
-*Case-control studies
-*Cohort studies
-*Case only studies
-*Case-parental control
-*Affected relative pair
-*Twin studies
-To model Gene-Environment interactions: $ outcome = gene + environment + gene-environment\,interaction.  Y=G+E+G\times E.$
+<pre>
+# Extract all unique sequences found in the simulation
+unique_genomes <- unique(pop$genome)
-===Numbers needed to treat (NNT)===
+# Calculate Hamming Distance Matrix
-Numbers needed to treat (NNT): an epidemiological measure to assess the effectiveness of a health-care intervention. It is the average number of patients needed to be treated to prevent one additional bad outcome. It is the inverse of incidence. For example, if we know the incidence of flu is 1 per 10,000, then NNT equals 10,000 (as the flu incidence is 1/10,000). If two treatments A (typically new treatment) and B (typically control or placebo) are considered for flu, with p_{A}, p_{B} representing the probabilities of having flu under treatments A and B, respectively. Then NNT can be computed as $\frac{1}{p_{B} – p_{A}}$. The ideal NNT is 1, and the higher the NNT, the less effective the treatment and more patients need to take the treatment to see a benefit of one. NNT in the range of 2-5 would generally be considered as an indication of an effective therapy. Treatment interventions that yield negative fractions $\frac{1}{p_{B} – p_{A}}$ are considered as harmful and their magnitudes are referred to as numbers needed to harm (NNH). NNH captures the number of patients needed to be exposed to a risk-factor over a time period of time to cause harm in one patient (harm due to intervention). Lower NNH implies worse the risk factor. NNT and NNH are powerful evidence-based measures of healthcare outcomes and aid clinicians quantify whether a particular treatment may expose the patient to harm while providing therapeutic benefits.
+dist_matrix <- stringdistmatrix(unique_genomes, unique_genomes, method = "hamming")
+rownames(dist_matrix) <- unique_genomes
+colnames(dist_matrix) <- unique_genomes
-* '''Example''': consider a Treatment and Control Trial, we have observed the following data: 800 out of 1,000 had disease DD in the treatment group A while 600 out of 1,200 had disease DD in the control group B. Then NNT = $\frac{1}{p_{B} – p_{A}}$ = $\frac{1}{\frac{800}{1000}-\frac{600}{1200}} = 3.3333$, which still indicates an effective treatment.
+# This matrix allows us to see how far 'Variant B' has drifted from 'Variant A'
+</pre>
-===Applications===
+* Integrated Spatio-Temporal & Transmission Model: We will update the simulation to record the transmission pairs to build the `igraph` tree and add the spatial density layer.
-*[http://books.google.com/books?hl=en&lr=&id=ofQrN-fB3kkC&oi=fnd&pg=PA56&dq=epidemiology&ots=18YSP4iZg1&sig=82CUcLdJbPtNyypTfz-0_k6pTjE#v=onepage&q=epidemiology&f=false This article] titled The development epidemiology of anxiety disorders: phenomenology, prevalence, and comorbidity reviewed the prevalence and comorbidity of anxiety disorders in general, and where possible the specifics of separation anxiety disorder (SAD), generalizes anxiety disorder (GAD), specific phobias, panic, social phobia, and panic disorder and commented on the existing problems in current studies. It argues that as the quality of measures used to assess anxiety disorders in the child and adolescent population have improved in the past few years. Several of the instruments developed for epidemiologic research are now being used in clinical settings and further integration of research methods can be expected in the near future.
-*[http://www.jstor.org/discover/10.2307/3702080?uid=3739728&uid=2&uid=4&uid=3739256&sid=21103852741501 This article] focuses on the topic of dose-response and trend analysis in Epidemiology and it presents two classes of simple alternative that can be implemented with any regression software: fractional polynomial regression and spline regression which work especially when important nonlinearities are anticipated and software for more nonparametric regression approaches is not available.
+<pre>
+# --- Initialization Fixes ---
+# Ensure genome column is a character vector to hold DNA strings
+pop$genome <- as.character(pop$genome)
+edges <- data.frame(from = integer(), to = integer())
-===Software===
+for (t in 1:timesteps) {
-*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
+  # 1. Movement
-*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html  Normal T Chi-Squared F tables]
+  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
+  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
+  # 2. Transmission
+  infected <- which(pop$state == "I")
+  susceptible <- which(pop$state == "S")
+  for (i in infected) {
+    # Check distances against only the susceptible population
+    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
+    contacts <- susceptible[dists < 5]
+    for (target in contacts) {
+      # Important: Check if target is still susceptible (in case they were infected
+      # by someone else earlier in this same timestep loop)
+      if (pop$state[target] == "S" && runif(1) < 0.3) {
+        # Update Status
+        pop$state[target] <- "I"
+        pop$infect_time[target] <- t
+        pop$parent_id[target] <- pop$id[i]
+        # 1. Expand Genomic Logic: DNA Mutation
+        # We replace the numeric addition with our string mutation function
+        new_sequence <- mutate_dna(pop$genome[i])
+        pop$genome[target] <- new_sequence
+        # 3. Track Transmission for igraph
+        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
+      }
+    }
+  }
+  # 3. Recovery
+  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
+  history[[t]] <- pop %>% mutate(time = t)
+}
-===Problems===
+sim_data <- bind_rows(history)
-A rare recessive disease has recently been mapped to chromosome 11p15.5.  It is found in 2 per every 500,000 people.  All individuals with two copies of the recessive allele develop immediately apparent symptoms.
-a. What is the frequency of the disease-causing allele in the population?
-b. What is the carrier frequency?
+# --- Visualization: Spatial Intensity ---
+ggplot(sim_data %>% filter(state == "I"), aes(x = x, y = y)) +
+  geom_density_2d_filled(
+    aes(fill = after_stat(level)),
+    alpha = 0.6,
+    contour_var = "ndensity"   # optional: normalize for better contrast
+  ) +
+  geom_point(aes(color = genome), alpha = 0.9) +
+  scale_fill_viridis_d() +      # discrete fill for the bands
+  theme_minimal() +
+  labs(title = "Viral Hotspots and Genomic Clustering") +
+  theme(legend.position = "none")
-c. In the U.S. population there are approximately 316 million people.  How many adults potentially carry the disease-causing allele (include only carriers, not homozygotes)?
+</pre>
+* The Transmission Tree (Phylogeny): Using ''igraph'' and ''ggraph'', we can visualize the
+''Who-Infects-Whom'' network. This is the gold standard for contact tracing and understanding super-spreading events.
-Consider the following scenarios:
+<pre>
-people from Population A and 1000 people from Population B were genotyped at a locus that has two alleles (C and T).These two populations are known to each be in HWE at this particular locus.
+# Create graph object
-<center>
+transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
-|-
-| Population || Genotype frequency of CC
-|-
-|A||0.64
-|-
-|B||0.25
-|-
-|}
-</center>
-a.Based on the frequency of the CC genotype in the populations, what are the frequencies of the C and T alleles in each population?
+# Plot the tree
+ggraph(transmission_graph, layout = 'tree') +
+  geom_edge_diagonal(alpha = 0.2, color = "grey") +
+  geom_node_point(color = "steelblue", size = 2) +
+  theme_void() +
+  labs(title = "Reconstructed Transmission Tree")
-b.  What are the genotype frequencies of CT and of TT in population A? What are they in population B?
+# An alternative plot_ly() SVG infection graph can be created by
+library(plotly)
-c.  In population A, how many people have the CC, CT, and TT genotypes? How many people have them in population B?
+library(igraph)
+library(dplyr)  # For piping and joins
-d.  1000 people a new population, Population D, were genotyped at the same locus.  This population recently experienced a lot of migration, so we suspect that it may not be in HWE. There are 350 people with the CC genotype, 400 with CT, and 250 with TT. Based on the number of people with each genotype, what are the genotype frequencies of CC, CT, and TT in population D?
+# Extract layout coordinates (tree structure)
+layout <- layout_as_tree(transmission_graph)
-e.  What are the allele frequencies of C and T in Population D?
+# Prepare node data with safe column names
+node_data <- data.frame(
+  node_id = V(transmission_graph)$name,  # Renamed from 'id'
+  x = layout[,1],
+  y = layout[,2]
+)
-f.   After doing a test for HWE, you conclude that Population D is NOT in HWE. You suspect that non-random mating has occurred in this population. If Population D were to mate randomly for one generation, what would the allele frequencies be in the next generation?
+# Calculate in-degree (number of incoming edges) to identify root nodes
+in_degree <- degree(transmission_graph, mode = "in")
-g.  What would the genotype frequencies be in Population D after one generation of random mating?
+# Create nodes dataframe with size based on root status
+nodes <- node_data %>%
+  mutate(
+    size = ifelse(in_degree == 0, 10, 7),  # Root nodes (in-degree=0) are larger
+    label = ifelse(in_degree == 0, "Root", "Descendant")
+  )
+# Prepare edge data with directional arrows
-For the following case-control study, a total of 1200 cases and 1200 controls were recruited. The genotypes of the cases and the controls are given below.
+edges <- as_data_frame(transmission_graph, what = "edges") %>%
+  left_join(nodes, by = c("from" = "node_id")) %>%
+  left_join(nodes, by = c("to" = "node_id"), suffix = c("_from", "_to"))
-<center>
+# 1. Calculate children for each node
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+children_info <- edges %>%
-|-
+  group_by(from) %>%
-| ||Cases || Controls
+  summarise(children = paste(to, collapse = ", ")) %>%
-|-
+  rename(node_id = from)
-|AA||374||445
-|-
-|AG||550||580
-|-
-|}
-</center>
-a. What are the allele frequencies in the cases?
+# 2. Calculate parents for each node
+parent_info <- edges %>%
+  group_by(to) %>%
+  summarise(parents = paste(from, collapse = ", ")) %>%
+  rename(node_id = to)
-b. What are the allele frequencies in the controls?
+# 3. Join this info back to your nodes dataframe
+nodes <- nodes %>%
+  left_join(children_info, by = "node_id") %>%
+  left_join(parent_info, by = "node_id") %>%
+  mutate(
+    children = ifelse(is.na(children), "None", children),
+    parents = ifelse(is.na(parents), "None", parents)
+  )
-c. What are the expected genotype frequencies under Hardy-Weinberg equilibrium in the cases and in the controls?
+# Create interactive plot
+transmission_plot <- plot_ly() %>%
+  add_segments(
+    data = edges,
+    x = ~x_from, y = ~y_from,
+    xend = ~x_to, yend = ~y_to,
+    line = list(
+      color = ~ifelse(x_from < x_to, "#1f77b4", "#ff7f0e"),
+      width = 2
+    ),
+    hoverinfo = "none"
+  ) %>%
+  add_trace(
+    data = nodes,
+    x = ~x,
+    y = ~y,
+    type = "scatter",
+    mode = "markers",
+    marker = list(
+      size = ~size,
+      opacity = 0.8,
+      line = list(width = 1, color = "black")
+    ),
+    # Added Parents and Children to the hover text
+    text = ~paste0(
+      "<b>Node ID:</b> ", node_id,
+      "<br><b>Type:</b> ", label,
+      "<br><b>Parent(s):</b> ", parents,
+      "<br><b>Children:</b> ", children
+    ),
+    hoverinfo = "text",
+    showlegend = FALSE
+  ) %>%
+  layout(
+    title = "Interactive Viral Transmission Tree (Graph)",
+    xaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
+    yaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
+    hovermode = "closest"
+  )
-d. Is there evidence of Hardy-Weinberg disequilibrium in either the cases or controls?
+transmission_plot
-e. What do the results of the HWE testing suggest in terms of which allele might be related to the disease?
+</pre>
+* Genetic Distance: By calculating  (distance between sequence  and ), we can identify if a new "cluster" appearing in a specific city is a local evolution or a new introduction from outside.
-You started a new job at the paternity testing lab, and your supervisor has asked you to look at the Southern blots of VNTRs to help determine paternity. For each of the three families below, you genotyped a single VNTR for the mother, the child, and two men who may potentially be the child’s father. Shown below are the Southern blots for each of the three families, with the number of repeats of the VNTR shown on the right- and left-hand sides of the blot.
+* Spatial Kernel: The ''geom_density_2d'' visualizes the spatial transmission kernel, identifying where the force of infection is highest.
-For each of the three families below, does analysis of the single VNTR provide information about whether either of the potential fathers may (or may not) be the true father of the child? Explain your answers, specifically stating the evidence (for example, “Potential Father 1 could be the true father if the child inherited the version with 3 repeats from the mother and version with 5 repeats from Potential Father 1”).
+* Tree Topology: A ''star-like'' phylogeny suggests rapid exponential growth, while a ladder-like
+phylogeny suggests a stable, endemic transmission chain.
-a. Family #1
+<pre>
-<center>
+# Proper Genetic Distance Matrix
-[[Image:SMHS Epi Figure 14.png| 350 px]]
+unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
-</center>
+dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
-b. Family #2
+# Optional: Visualize distance as a heatmap
-<center>
+image(as.matrix(dist_mat), main = "Genetic Distance Heatmap")
-[[Image:SMHS Epi Figure 15.png| 350 px]]
-</center>
-c.  Family #3
+# SVG plot using plot_ly()
-<center>
+library(plotly)
-[[Image:SMHS Epi Figure 16.png| 350 px]]
-</center>
+# 1. Get the unique viral strains
+unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
-Analyze the following pedigree under an autosomal recessive model.
+# 2. Generate clean Short IDs (e.g., "Strain 1", "Strain 2", etc.)
-<center>
+# Or use the first few characters if that's where the ID is hidden
-[[Image:SMHS Epi Figure 17.png| 350 px]]
+short_ids <- paste0("S-", seq_along(unique_strains))
-</center>
-a. Assuming complete penetrance, what is the penetrance function (probability of disease for each genotype) under the autosomal recessive model where ‘d’ is the recessive deleterious allele?
-P(disease|DD) =
+# 3. Calculate the matrix using the full sequences
+dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
+m <- as.matrix(dist_mat)
-P(disease|Dd) =
+# 4. Handle Infinities
+m[is.infinite(m)] <- max(m[is.finite(m)], na.rm = TRUE)
-P(disease|dd) =
+# 5. Assign the CLEAN IDs to the matrix
+rownames(m) <- short_ids
+colnames(m) <- short_ids
+# 6. Plot
+fig <- plot_ly(
+    z = m,
+    x = colnames(m),
+    y = rownames(m),
+    type = "heatmap",
+    colorscale = "Viridis",
+    # This customdata trick lets you see the FULL sequence when you hover!
+    customdata = matrix(unique_strains, nrow=nrow(m), ncol=ncol(m)),
+    hovertemplate = "Row: %{y}<br>Col: %{x}<br>Dist: %{z}<extra></extra>"
+  ) %>%
+  layout(
+    xaxis = list(title = "Strain ID", type = "category"),
+    yaxis = list(title = "Strain ID", type = "category", autorange = "reversed")
+  )
-b. Fill in the potential genotypes of each person under an autosomal recessive model (using the information from part ‘a’ above).
+fig
-Possible Genotypes
+</pre>
-I-1
-I-2
+* Finally, display the transmission tree graph.
-II-1
+<pre>
+inf_graph <- graph_from_data_frame(edges, directed = TRUE)
-II-2
+ggraph(inf_graph, layout = 'kk') +
+  geom_edge_link(alpha = 0.3, arrow = arrow(length = unit(2, 'mm'))) +
+  geom_node_point(color = "red", size = 1) +
+  theme_void() +
+  labs(title = "Epidemiological Transmission Network")
+</pre>
-II-3
+=== Applications and Software ===
+Modern epidemiology relies heavily on computational tools.
+* Key R Packages: ''epiR'' (Epi measures), ''genetics'' (HWE, LD), ''survival'' (time-to-event), ''qqman'' (GWAS visualization).
+* Online Tools: [https://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html SOCR Distribution Tables].
+=== Problems ===
-c. If the deleterious allele occurs at a frequency of 0.05 in the population, what are the probabilities of ANY two parents in the population having genotypes DD, Dd, or dd? (use founder probability concept).
+==== Problem 1: Linkage Mapping ====
+'''Scenario''': Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction <math>\theta</math>.
 <center>
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+[[Image:SMHS_Epi_Figure_8.png|250px]]
+</center>
+# Calculate LOD Scores:
+Using Maximum Likelihood Estimation (MLE), if the phase is known and the data include 4 non-recombinant and 1 recombinant meioses, the likelihood is:
+:<math>L(\theta) = (1-\theta)^4 \theta</math>.
+The LOD score is defined as:
+:<math>Z(\theta) = \log_{10}\left(\frac{L(\theta)}{L(0.5)}\right)</math>, where <math>L(0.5) = (0.5)^5 = 1/32</math>.
+# Corrected Result Table:
+{|class="wikitable" style="text-align:center;width:50%" border="1"
 |-
-|  ||Possible Genotype || P(Genotype)
+! <math>\theta</math> !! <math>Z(\theta)</math>
 |-
-| rowspan=3|I-1:||DD||
+| 0.0 || <math>-\infty</math>
 |-
-|  Dd ||
+| 0.10 || 0.322
 |-
-|  dd||
+| 0.20 || '''0.418''' (Max LOD)
 |-
-|rowspan=3|I-2:||DD||
+| 0.50 || 0.0
-|-
+|}
-|  Dd||
+The maximum LOD score occurs at <math>\hat{\theta} = 0.20</math>, consistent with 1 recombinant out of 5 informative meioses.
-|-
+> Note: A LOD > 3 is conventionally required for significant linkage; this example is illustrative.
-|  dd||
-|-
+==== Problem 2: Number Needed to Treat (NNT) and Harm (NNH) ====
-|}
+'''Scenario''': A trial reports 800 events among 1000 patients in Treatment Group A and 600 events among 1200 patients in Control Group B.
-</center>
+# Event rates:
+:<math>p_A = 800/1000 = 0.80</math>, <math>p_B = 600/1200 = 0.50</math>.
+# Absolute Risk Increase (ARI):
+:<math>\text{ARI} = p_A - p_B = 0.30</math>.
+# Since the treatment increases risk, we compute the Number Needed to Harm (NNH):
+:<math>\text{NNH} = \frac{1}{\text{ARI}} = \frac{1}{0.30} \approx 3.33</math>.
+'''Interpretation''': For every 3–4 patients treated with intervention A (vs. control), one additional adverse event occurs. The negative NNT is conventionally reported as NNH.
+==== Problem 3: GWAS Power Analysis (R) ====
+'''Scenario''': Simulate a case-control GWAS with 500 cases and 500 controls, Minor Allele Frequency (MAF) = 0.2, and Odds Ratio (OR) = 1.5. Estimate statistical power at <math>\alpha = 0.05</math>.
+The following R function correctly simulates genotypes separately for cases and controls based on genotype relative risks derived from the OR:
+  gwas_power_case_control <- function(n_cases = 200, n_controls = 200,
+                                   maf = 0.2, OR = 1.5,
+                                   alpha = 0.05, n_sims = 100) {
+  # Genotype coding: 0, 1, 2 copies of effect allele
+  # Relative risks under additive log-odds model: 1, OR, OR^2
+  rr <- c(1, OR, OR^2)
+  # Genotype frequencies in controls (Hardy-Weinberg)
+  g_ctrl <- dbinom(0:2, size = 2, prob = maf)
+  # Genotype frequencies in cases (proportional to g_ctrl * rr)
+  g_case_unscaled <- g_ctrl * rr
+  g_case <- g_case_unscaled / sum(g_case_unscaled)
+  significant <- numeric(n_sims)
+  for (i in 1:n_sims) {
+    # Simulate genotypes
+    geno_cases <- sample(0:2, n_cases, replace = TRUE, prob = g_case)
+    geno_ctrls <- sample(0:2, n_controls, replace = TRUE, prob = g_ctrl)
+    geno <- c(geno_cases, geno_ctrls)
+    status <- c(rep(1, n_cases), rep(0, n_controls))
+    # Fit logistic regression
+    model <- glm(status ~ geno, family = binomial)
+    p_val <- summary(model)$coefficients[2, 4]
+    significant[i] <- as.numeric(p_val < alpha)
+  }
+  return(mean(significant))  # Estimated power
+  }
+  # Example usage:
+  set.seed(2026)
+  power_est <- gwas_power_case_control()
+  cat("Estimated power:", round(power_est, 3), "\n")
+==== Problem 4: Hardy-Weinberg Equilibrium (HWE) Testing ====
+'''Scenario''': In a GWAS control group (<math>n = 1000</math>), genotype counts for a SNP are:
+AA = 240, Aa = 120, aa = 40. Test HWE using an exact test.
+  # Install if needed: install.packages("HardyWeinberg")
+  library(HardyWeinberg)
+  # Observed counts: AA, Aa, aa
+  obs <- c(240, 120, 40)
+  # Exact HWE test
+  hwe_test <- HWExact(obs, verbose = FALSE)
+  p_val <- hwe_test$pval
+  cat("HWE exact test p-value:", format.pval(p_val, digits = 4), "\n")
+  # Interpretation: p < 0.001 suggests deviation from HWE (common QC threshold in GWAS)
+==== Problem 5: Assessing Confounding in Observational Data ====
+'''Scenario''': Evaluate whether age and sex confound the association between an exposure and disease outcome.
+We'll use the ''Logistic Distribution'' function '''plogis''', which provides the standard Density, distribution function, quantile function and random generation for the logistic distribution with parameters ''location'' and ''scale'', and '''rbinom'''.
+<pre>
+  set.seed(123)
+  n <- 5000
+  # Simulate confounders and exposure
+  age <- rnorm(n, mean = 50, sd = 10)
+  sex <- rbinom(n, 1, 0.5)
+  exposure <- rbinom(n, 1, plogis(-1 + 0.02 * age + 0.3 * sex))
+  # Simulate disease influenced by exposure and confounders
+  disease <- rbinom(n, 1, plogis(-2 + 0.9 * exposure + 0.03 * age + 0.4 * sex))
+  # Crude model
+  crude_or <- exp(coef(glm(disease ~ exposure, family = binomial))["exposure"])
+  # Adjusted model
+  adj_or <- exp(coef(glm(disease ~ exposure + age + sex, family = binomial))["exposure"])
+  # Percent change due to confounding
+  confounding_pct <- (crude_or - adj_or) / crude_or * 100
+  cat("Crude OR:", round(crude_or, 2), "\n")
+  cat("Adjusted OR:", round(adj_or, 2), "\n")
+  cat("Percent change:", round(confounding_pct, 1), "%\n")
+  if (abs(confounding_pct) > 10) {
+   cat("→ Substantial confounding detected (change > 10%).\n")
+  } else cat("→ No substantial confounding detected (change < 10%).\n")
+</pre>
+==== Problem 6: Diagnostic Test Accuracy via ROC Analysis ====
+'''Scenario''': A continuous biomarker is measured in 100 diseased and 100 non-diseased individuals. Determine the optimal cutoff, sensitivity, and specificity using [https://doi.org/10.1002/bimj.200710415 Youden’s index].
+<pre>
+  # Install if needed: install.packages("pROC")
+  library(pROC)
+  # Simulate biomarker scores
+  set.seed(42)
+  score_healthy <- rnorm(100, mean = 0, sd = 1)
+  score_diseased <- rnorm(100, mean = 1.5, sd = 1)
+  score <- c(score_healthy, score_diseased)
+  status <- c(rep(0, 100), rep(1, 100))
+  # ROC analysis
+  roc_curve <- roc(status, score)
+  auc_val <- auc(roc_curve)
+  # Optimal threshold (Youden's J statistic)
+  opt <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity"))
+  cat("AUC:", round(auc_val, 3), "\n")
+  cat("Optimal cutoff:", round(opt[["threshold"]], 2), "\n")
+  cat("Sensitivity:", round(opt[["sensitivity"]], 3), "\n")
+  cat("Specificity:", round(opt[["specificity"]], 3), "\n")
+</pre>
+====Problem 7====
+Try to integrate structural fixes, biological realism (fitness selection), and technical optimizations into the above genetics simulation of ''Phylodynamics and Genomic Surveillance in R''.
+Try to demonstrates how viral mutations, spatial movement, and natural selection interact to shape an outbreak.
+The Core Simulation Engine can initialize a population and simulates a ''Random Walk'' movement model,
+by introducing a Fitness Function; mutations at specific genomic loci now increase the transmission probability, simulating the emergence of a ''Variant of Concern.''
+<pre>
+library(tidyverse)
+library(gganimate)
+library(stringdist)
+library(igraph)
+library(ggraph)
+library(plotly)
+# --- Setup Parameters ---
+set.seed(42)
+n_agents <- 200
+map_size <- 100
+initially_infected <- 3
+mutation_rate <- 0.02
+timesteps <- 60
+seq_length <- 50
+bases <- c("A", "C", "G", "T")
+# --- Helper Functions ---
+founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")
+mutate_dna <- function(sequence) {
+  seq_vec <- strsplit(sequence, "")[[1]]
+  mutation_mask <- runif(seq_length) < mutation_rate
+  if(any(mutation_mask)) {
+    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
+  }
+  return(paste(seq_vec, collapse = ""))
+}
+calc_fitness <- function(sequence) {
-d. Using transmission probabilities, determine the probability of each offspring’s genotype assuming a Dd x dd mating.
+  gc_content <- nchar(gsub("[AT]", "", sequence)) / nchar(sequence)
+  return(0.3 + (gc_content * 0.2))
+}
-<center>
+# --- Initialize Population ---
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
+pop <- tibble(
-|-
+  id = 1:n_agents,
-| || Possible Genotype || P(Genotype)
+  x = runif(n_agents, 0, map_size),
-|-
+  y = runif(n_agents, 0, map_size),
-|rowspan=3|II-1:||DD||
+  state = "S",
-|-
+  genome = founder_seq,
-|  Dd ||
+  parent_id = NA_integer_,
-|-
+  infect_time = NA_real_
-|  dd ||
+)
-|-
-|rowspan=3|II-2: || DD ||
-|-
-|  Dd ||
-|-
-|  dd ||
-|-
-|rowspan=3|II-3: || DD ||
-|-
-| Dd ||
-|-
-|  dd ||
-|-
-|}
-</center>
+# Set state to "I" for initial cases
+pop$state[1:initially_infected] <- "I"
+pop$infect_time[1:initially_infected] <- 0
-e. Compare the penetrance for each member under two models: a completely penetrant autosomal recessive disease, and an incomplete penetrance model $(P(aff|dd)=.9; P(aff|Dd)=0.2; P(aff|DD)=0).$  What are the probabilities of each pedigree member phenotype, given the possible genotypes
+# --- Run Simulation ---
+history <- list()
+edges <- data.frame(from = integer(), to = integer())
-<center>
+for (t in 1:timesteps) {
-{|class="wikitable" style="text-align:center;width:50%" Border="1"
-|-
+  # Assignment operator (<-) to update movement
-| Affected ||Possible|| Complete Penetrance|| Incomplete Penetrance
+  # loaded cyclical harmonics
-|-
+  # pop$x <- pop$x * rep(0.1 + cos(10*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
-|Not Affected||Genotype||P(phenotype/genotype)||P(phenotype/genotype)
+  # pop$y <- pop$y * rep(0.1 + sin(5*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
-|-
-|I-1||Not||DD ||
+  # simple stochastic dynamics
-|-
+  pop$x <- pop$x + rnorm(n_agents, 0, 2)
-|  || Affected || Dd ||
+  pop$y <- pop$y + rnorm(n_agents, 0, 2)
-|-
-| || dd ||  ||
-|-
-|I-2||Affected||DD ||
-|-
-|  || Dd ||  ||
-|-
-| || dd|| ||
-|-
-|II-1||Affected||DD ||
-|-
-|  || || Dd ||
-|-
-| || dd || ||
-|-
-|II-2||Affected||DD  ||
-|-
-|  || Dd || ||
-|-
-| || dd||  ||
-|-
-|II-3||Affected||DD||
-|-
-|  || Dd  || ||
-|-
-| || dd  ||  ||
-|-
-|}
-</center>
-f. Putting it all together, what is the probability of this pedigree under the complete penetrance model in the scenario below? What is the probability of this pedigree under the incomplete penetrance model?
+  # Optional: Keep agents within map boundaries
-<center>
+  pop$x <- pmax(0, pmin(map_size, pop$x))
-[[Image:SMHS Epi Figure 18.png| 300 px]]
+  pop$y <- pmax(0, pmin(map_size, pop$y))
-</center>
-Complete penetrance model, autosomal recessive:
+  # 2. Transmission
+  infected_idx <- which(pop$state == "I")
+  for (i in infected_idx) {
+    susceptible_idx <- which(pop$state == "S")
+    if(length(susceptible_idx) == 0) break
+    dists <- sqrt((pop$x[susceptible_idx] - pop$x[i])^2 + (pop$y[susceptible_idx] - pop$y[i])^2)
+    contacts <- susceptible_idx[dists < 6]
+    for (target in contacts) {
+      if (pop$state[target] == "S" && runif(1) < calc_fitness(pop$genome[i])) {
+        pop$state[target] <- "I"
+        pop$infect_time[target] <- t
+        pop$parent_id[target] <- pop$id[i]
+        pop$genome[target] <- mutate_dna(pop$genome[i])
+        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
+      }
+    }
+  }
-Incomplete penetrance model, autosomal recessive:
+  # subsetting and assignment for Recovery
+  recovery_idx <- which(pop$state == "I" & (t - pop$infect_time) >= 10)
+  if(length(recovery_idx) > 0) {
+    pop$state[recovery_idx] <- "R"
+  }
+  history[[t]] <- pop %>% mutate(time = t)
+}
+sim_data <- bind_rows(history)
+# str(sim_data)
+</pre>
-Consider a woman who is a known heterozygous carrier of a mutation that causes the recessive disease PKU. She is shown in generation I. For the questions below, briefly explain how you got your answers. (You can assume that individuals entering the pedigree from outside the family are NOT carriers of the PKU-causing allele.)
+. Spatio-Temporal Visualization: By mapping the genomic strings to colors, we can visualize the "clade emergence" over time.
-<center>
-[[Image:SMHS Epi Figure 19.png| 300 px]]
-</center>
-a. What is the probability that her grandson, individual B, will be a heterozygous carrier of this PKU-causing allele?
-b. What is the probability that both of her granddaughters, individuals A and C, will both be heterozygous carriers of this PKU-causing allele?
+<pre>
+anim <- ggplot(sim_data, aes(x = x, y = y, color = genome, size = state)) +
+geom_point(alpha = 0.6) +
+scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
+theme_minimal() +
+theme(legend.position = "none") +
+labs(title = "Viral Spread & Selection Pressure: Day {frame}") +
+transition_time(time)
-c. What is the probability that all three of her grandchildren will be heterozygous carriers of this PKU-causing allele?
+# animate(anim)
+</pre>
+. Genomic Tracking & Distance Matrices: To understand how much the virus has drifted, we use the **Hamming Distance**. This calculates the number of base substitutions between any two viral samples.
-In some cultures, nieces are encouraged to marry their uncles as shown in the pedigree below. Here, the niece is pedigree member III-1, and the uncle is designated II-3.
+<pre>
-<center>
+# Extract final strains
-[[Image:SMHS Epi Figure 20.png| 300 px]]
+final_strains <- sim_data %>%
-</center>
+filter(state %in% c("I", "R"), !is.na(infect_time)) %>%
+distinct(genome) %>%
+pull(genome)
-a. If these two individuals have a child, what is the probability that the child will have alleles that are identical by descent at any given locus? (Explain your reasoning / show your work)
+dist_mat <- stringdistmatrix(final_strains, final_strains, method = "hamming")
+rownames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
+colnames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
+# Visualize Genetic Distance
+plot_ly(z = ~dist_mat, x = colnames(dist_mat), y = rownames(dist_mat), type = "heatmap", colorscale = "Viridis") %>%
+layout(title = "Genetic Hamming Distance Matrix")
+</pre>
-b. If the paternal common ancestor (designated I-1) is a carrier for an allele that will cause a recessive disease, what is the probability that the child will have the disease?
+. Phylogeny: The Transmission Tree: While a distance matrix shows "how much" change occurred, the Transmission Tree shows "who infected whom." A "star-like" tree indicates rapid expansion, while a "ladder-like" tree indicates a slow, constant chain of transmission.
+<pre>
+# Reconstruct the tree using igraph
+transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
-Analyze the following pedigree. Assume that the disease is autosomal dominant and fully penetrant.
+ggraph(transmission_graph, layout = 'tree') +
-<center>
+geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
-[[Image:SMHS Epi Figure 21.png| 400 px]]
+geom_node_point(aes(color = name), size = 3, show.legend = FALSE) +
-</center>
+theme_void() +
+labs(title = "Reconstructed Epidemiological Transmission Tree")
+</pre>
-a.  Using “D” to represent the dominant allele and “d” to represent the recessive allele, what are the genotypes of each person in the pedigree?
+. Quantifying the Molecular Clock: A key concept in phylodynamics is the ''Molecular Clock'', the idea that mutations accumulate at a relatively constant rate over time.
-{|class="wikitable" style="text-align:center;width:30%" Border="1"
+<pre>
-|-
+# Calculate divergence from founder
-|I-1 ||
+clock_data <- sim_data %>%
-|-
+filter(!is.na(infect_time)) %>%
-|I-2 ||
+mutate(dist_from_founder = stringdist(genome, founder_seq, method = "hamming"))
-|-
-|II-1 ||
-|-
-|II-2 ||
-|-
-|II-3 ||
-|-
-|II-4 ||
-|-
-|II-5 ||
-|-
-|II-6 ||
-|-
-|II-7 ||
-|-
-|}
+ggplot(clock_data, aes(x = infect_time, y = dist_from_founder)) +
+geom_jitter(alpha = 0.2, color = "steelblue") +
+geom_smooth(method = "lm", color = "red") +
+theme_minimal() +
+labs(title = "The Molecular Clock: Mutation Accumulation",
+x = "Time of Infection (Days)",
+y = "Hamming Distance from Founder")
+</pre>
-b. The number of repeats at a particular VNTR locus were measured on each person in the family, and are given in the table below.
+Here are a couple of alternative plots ...
-{|class="wikitable" style="text-align:center;width:30%" Border="1"
-|-
-|Pedigree Member || Number of VNTR Repeats
-|-
-|I-1 || 125-137
-|-
-|I-2 || 129-141
-|-
-|II-1 || 137-/141
-|-
-|II-2 || 125/129
-|-
-|II-3 || 137/129
-|-
-|II-4 || 125/141
-|-
-|II-5 || 125/129
-|-
-|II-6 || 137/141
-|-
-|II-7 || 125/141
-|-
-|}
-c. How many informative meioses are there in this pedigree? Did the informative meioses happen in the mother, the father, or both?
+<pre>
+library(ggraph)
+library(ggplot2)
+library(plotly)
-What are the haplotypes of the offspring generation?
+# Create the ggraph plot
-II-1
+gg <- ggraph(transmission_graph, layout = 'tree') +
+  geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
+  geom_node_point(aes(color = name,
+                      text = paste("ID:", name,
+                                   "<br>Click for details")),
+                  size = 5) +
+  theme_void() +
+  labs(title = "Reconstructed Epidemiological Transmission Tree")
-II-2
+# Convert to interactive plotly
+ggplotly(gg, tooltip = "text") %>%
+  layout(
+    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
+    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
+    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
+  )
+</pre>
-II-3
+and ...
-II-4
+<pre>
+library(plotly)
+library(igraph)
+library(networkD3)
-II-5
+# Convert igraph to networkD3 format
+transmission_d3 <- igraph_to_networkD3(transmission_graph)
-II-6
+# Prepare nodes with tooltips
+nodes_d3 <- transmission_d3$nodes
+if(exists("pop")) {
+  nodes_d3 <- nodes_d3 %>%
+    left_join(
+      pop %>%
+        mutate(name = as.character(id)) %>%
+        select(name, state, infect_time, genome),
+      by = "name"
+    )
+} else {
+  nodes_d3$state <- "Unknown"
+  nodes_d3$infect_time <- NA
+  nodes_d3$genome <- NA
+}
-II-7
+# Create custom tooltips
+nodes_d3$tooltip <- sprintf(
+  "ID: %s<br>State: %s<br>Infection Time: %s<br>Genome: %s",
+  nodes_d3$name,
+  nodes_d3$state,
+  ifelse(is.na(nodes_d3$infect_time), "N/A", nodes_d3$infect_time),
+  ifelse(is.na(nodes_d3$genome), "N/A", substr(nodes_d3$genome, 1, 15))
+)
+# Create the force-directed network
+forceNetwork(
+  Links = transmission_d3$links,
+  Nodes = nodes_d3,
+  Source = 'source',
+  Target = 'target',
+  NodeID = 'name',
+  Group = 'state',
+  Nodesize = 5,
+  opacity = 0.9,
+  zoom = TRUE,
+  fontSize = 14,
+  linkDistance = 150,
+  charge = -100,
+  opacityNoHover = 0.5,
+  legend = TRUE,
+  bounded = TRUE,
+  clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
+)
+</pre>
-d. What are the potential haplotypes of the parental generation?
+and ...
-I-1
-I-2
-e. How many recombinant offspring are there? Give a separate answer for each potential I-1 haplotype.
-f. Using the formula $\theta$=k/n, what is the maximum likelihood estimate of $\theta$? (give a separate answer for each potential I-1 haplotype).  For which I-1 haplotype does $\theta$ make intuitive sense?
+<pre>
+library(visNetwork)
+library(igraph)
+library(tidyverse)
-g. Although we can guess the phase of I-1, assume for the remainder of the question that it is unknown. What is the general form of L($\theta$) for this pedigree?
+# 1. Create the graph object
+transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
-h.  What is the general form of Z($\theta$) for this pedigree?
+# 2. Prepare nodes data
+# Get vertex names
+node_names <- V(transmission_graph)$name
-i.  Calculate the LOD score (Z($\theta$)) for the family above at the following values of $\theta$
+# Merge with pop data for additional information
+nodes <- data.frame(
+  id = node_names,
+  label = node_names,
+  title = NA,  # Will be filled with tooltip info
+  stringsAsFactors = FALSE
+)
-$\theta	       \,\,\,\,\,\,\,\,\,$  LOD Score
+# Join with pop data to get metadata for tooltips
+if(exists("pop")) {
+  nodes <- nodes %>%
+    left_join(
+      pop %>%
+        mutate(id = as.character(id)) %>%
+        select(id, state, infect_time, genome),
+      by = "id"
+    ) %>%
+    mutate(
+      title = paste0(
+        "<b>Agent ID:</b> ", id, "<br>",
+        "<b>State:</b> ", state, "<br>",
+        "<b>Infection Time:</b> ", infect_time, "<br>",
+        "<b>Genome:</b> ", ifelse(!is.na(genome), substr(genome, 1, 20), "N/A")
+      )
+    )
+} else {
+  # Fallback if pop doesn't exist
+  nodes$title <- paste0("<b>Agent ID:</b> ", nodes$id)
+}
+# 3. Prepare edges data
+edges_df <- igraph::as_data_frame(transmission_graph, what = "edges")
+edges_vis <- data.frame(
+  from = edges_df$from,
+  to = edges_df$to,
+  arrows = "to",
+  color = list(color = "gray", opacity = 0.6),
+  width = 1,
+  smooth = TRUE
+)
-.05
+# 4. Optional: Add colors based on state or infection time
+if("state" %in% colnames(nodes)) {
+  # Create color mapping for different states
+  unique_states <- unique(nodes$state)
+  colors <- viridisLite::viridis(length(unique_states))
+  color_map <- setNames(colors, unique_states)
+  nodes$color <- sapply(nodes$state, function(s) {
+    if(is.na(s)) "#808080" else color_map[s]
+  })
+} else if("infect_time" %in% colnames(nodes)) {
+  # Color by infection time
+  min_time <- min(nodes$infect_time, na.rm = TRUE)
+  max_time <- max(nodes$infect_time, na.rm = TRUE)
+  # Normalize infection time for color scale
+  nodes$value <- nodes$infect_time
+  nodes$color <- viridisLite::viridis(100)[
+    cut(nodes$infect_time, 100, labels = FALSE)
+  ]
+} else {
+  nodes$color <- "#2C3E50"  # Default color
+}
-.10
+# 5. Create the interactive network
+visNetwork(nodes, edges_vis,
+           main = "Interactive Transmission Tree",
+           submain = "Hover over nodes for details") %>%
+  visNodes(
+    shape = "dot",
+    size = 15,
+    borderWidth = 2,
+    borderWidthSelected = 3,
+    shadow = list(enabled = TRUE, size = 10)
+  ) %>%
+  visEdges(
+    shadow = TRUE,
+    smooth = list(type = "curvedCW", roundness = 0.1)
+  ) %>%
+  visOptions(
+    highlightNearest = list(
+      enabled = TRUE,
+      degree = 1,
+      hover = TRUE,
+      algorithm = "hierarchical"
+    ),
+    nodesIdSelection = list(
+      enabled = TRUE,
+      values = nodes$id,
+      style = 'width: 150px; height: 26px'
+    )
+  ) %>%
+  visInteraction(
+    hover = TRUE,
+    hoverConnectedEdges = TRUE,
+    tooltipDelay = 0,
+    navigationButtons = TRUE,
+    dragNodes = TRUE,
+    dragView = TRUE,
+    zoomView = TRUE
+  ) %>%
+  visPhysics(
+    stabilization = TRUE,
+    solver = "forceAtlas2Based",
+    forceAtlas2Based = list(
+      gravitationalConstant = -50,
+      centralGravity = 0.01,
+      springLength = 100,
+      springConstant = 0.08,
+      damping = 0.4,
+      avoidOverlap = 1
+    )
+  ) %>%
+  visLayout(randomSeed = 123)  # For reproducible layout
+</pre>
-.20
+As well as ...
-j.  What is the maximum likelihood estimate of $\theta$ from part (i)?
+<pre>
+# Alternative using plotly's network graph
+library(networkD3)
-k.  A larger collection of pedigrees were ascertained that have the disease; listed below are their LOD scores at the maximum likelihood estimate of $\theta$ that you calculated in part (j).  What do you conclude about linkage when you consider these pedigrees in addition to the one you have already been analyzing? Is there significant evidence for linkage?
+# Convert to D3 format
+transmission_d3 <- igraph_to_networkD3(transmission_graph)
-{|class="wikitable" style="text-align:center;width:30%" Border="1"
+# Add node attributes
-|-
+transmission_d3$nodes <- transmission_d3$nodes %>%
-|Pedigrees || Z$\theta$
+  left_join(
-|-
+    pop %>%
-|1 || 0.22
+      mutate(name = as.character(id)) %>%
-|-
+      select(name, state, infect_time, genome),
-|2 || 0.34
+    by = "name"
-|-
+  ) %>%
-|3 || 1.06
+  mutate(
-|-
+    tooltip = paste0(
-|4 ||-0.51
+      "Agent ID: ", name,
-|-
+      "<br>State: ", state,
-|5 || 1.05
+      "<br>Infection Time: ", infect_time,
-|-
+      "<br>Genome: ", ifelse(!is.na(genome), substr(genome, 1, 10), "NA"), "..."
-|6 || 0.65
+    )
-|-
+  )
-|}
+# Create force directed network
+forceNetwork(
+  Links = transmission_d3$links,
+  Nodes = transmission_d3$nodes,
+  Source = 'source',
+  Target = 'target',
+  NodeID = 'name',
+  Group = 'state',
+  Nodesize = 5,
+  opacity = 0.8,
+  zoom = TRUE,
+  fontSize = 12,
+  linkDistance = 100,
+  charge = -30,
+  opacityNoHover = 0.5,
+  legend = TRUE
+)
+</pre>
-l. Now, imagine that there was an error in the lab, and individual II-5 actually has the VNTR alleles 137/129. Calculate the LOD score (Z($\theta$)) for the family above at the following values of $\theta$.
+Also try to experiment with:
-$\theta	       \,\,\,\,\,\,\,\,\,$  LOD Score
+: 1. Examine Selection Pressure: Modify the calc_fitness() function to give a 50% transmission boost to any sequence containing the pattern GAA. How does this change the speed of the outbreak?
+: 2. Are there Bottleneck Effects? Reduce the ''initially_infected' to 1 and observe how the genetic diversity of the final population changes compared to starting with 10 infected individuals.
-.05
+: 3. Contact Tracing: Use the ''transmission_graph'' to identify '''Super-spreaders''', i.e., nodes with high out-degree.
-.10
+=== References ===
-m.  Now, given this lab error, what do you conclude about linkage given this pedigree and the other pedigrees in part (k)? Use the same MLE that you calculated in part (j).
+# [https://socr.umich.edu/DSPA2/DSPA2_notes/DSPA_Appendix_26_SMHS_Epidemiology.html DSPA2 Appendix 26: Eidemiology]
+# Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott, 2008.
+# Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, 2013.
+# Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2010.
-===References===
-* [http://en.wikipedia.org/wiki/Epidemiology Epidemiology Wikipedia]
-* [http://en.wikipedia.org/wiki/Number_needed_to_treat  NNT Wikipedia]
-* [http://en.wikipedia.org/wiki/Number_needed_to_harm  NNH Wikipedia]
 <hr>
-* SOCR Home page: http://www.socr.umich.edu
+* SOCR Home page: https://socr.umich.edu
-{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Epidemiology}}
+{{translate|pageName=https://wiki.socr.umich.edu/index.php?title=SMHS_Epidemiology}}

Metric	Interpretation	Calculation
ARR	The actual difference in risk between groups.	\(I_{unexposed} - I_{exposed}\)
RR	How many times more likely the outcome is in the exposed group.	\(I_{exposed} / I_{unexposed}\)
OR	The ratio of the odds of the outcome occurring in the exposed group.	\(\frac{a \times d}{b \times c}\)
NNT	The number of patients needed to treat to prevent one bad outcome.	\(1 / ARR\)

Difference between revisions of "SMHS Epidemiology"

Latest revision as of 15:49, 17 February 2026

Contents

Scientific Methods for Health Sciences - Epidemiology

Overview

Motivation

Theory: The Human Genome and Mutation

Chromosomal Structure

Mutations and Abnormalities

Population Genetics

Allele and Genotype Frequencies

Hardy-Weinberg Equilibrium (HWE)

Pedigree Analysis and Inheritance

Modes of Inheritance

Probability in Pedigrees

Linkage Analysis

LOD Score

Linkage Disequilibrium (LD) and Association

Measures of LD

Additional Considerations

Genome-Wide Association Studies (GWAS)

Gene-Environment Interactions

Core Epidemiological Measures

Absolute Risk Reduction (ARR)

Relative Risk (RR)

Odds Ratio (OR)

Number Needed to Treat (NNT) or Harm (NNH)

Additional Considerations

Rmd/R Epidemiological simulation of a hypothetical viral epidemic

The Core Simulation Engine

Genomic Tracking & Phylogeny

Realistic Epidemiological Quantification

Applications and Software

Problems

Problem 1: Linkage Mapping

Problem 2: Number Needed to Treat (NNT) and Harm (NNH)

Problem 3: GWAS Power Analysis (R)

Problem 4: Hardy-Weinberg Equilibrium (HWE) Testing

Problem 5: Assessing Confounding in Observational Data

Problem 6: Diagnostic Test Accuracy via ROC Analysis

Problem 7

References

Navigation menu

Search