Difference between revisions of "SMHS Epidemiology"

From SOCR
Jump to: navigation, search
(Realistic Epidemiological Quantification)
 
(113 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<hr>
+
== [[SMHS|Scientific Methods for Health Sciences]] - Epidemiology ==
* SOCR Home page: http://www.socr.umich.edu
 
==[[SMHS| Scientific Methods for Health Sciences]] - Epidemiology ==
 
  
===Overview:===
+
=== Overview ===
 +
Epidemiology is the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow's investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.
  
After a general introduction to the filed of Epidemiology, students can have a basic idea of the language of Epidemiology. In this course, we want to identify and describe the population patterns of health-related risk factors and health-related outcomes in terms of persons, place and time. We are interested in exploring current major public health issues and try to identify and evaluate the main determinants of such public health issues (e.g. demographic, genetic, infectious, behavioral, and social). With all the concepts and methodologies of analysis in Epidemiology, application would be the next step. Here we examine and apply analytical approaches to data from different epidemiologic study designs (e.g., cross-sectional, cohort, randomized studies) and to critically appraise epidemiological findings.
+
A core framework in epidemiology is the "epidemiologic triad" of agent, host, and environment, but contemporary approaches emphasize the "person, place, and time" dimensions:
 +
* '''Person''': Characteristics of individuals, such as age, sex, genetics, socioeconomic status, and behaviors that influence susceptibility or exposure.
 +
* '''Place''': Geographic variations, including urban vs. rural settings, climate, or access to healthcare, which can reveal environmental or social risk factors.
 +
* '''Time''': Temporal patterns, such as seasonal trends, secular changes over years, or sudden outbreaks, helping identify emerging threats or intervention effects.
  
===Motivation:===
+
This section delves into genetic epidemiology, bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.
  
Goals of this course:
+
=== Motivation ===
 +
This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:
 +
* '''Understand Genetic Foundations''': Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tending to be inherited together unless separated by recombination).
 +
* '''Analyze Population Dynamics''': Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
 +
* '''Evaluate Associations''': Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
 +
* '''Apply Computational Methods''': Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
 +
* '''Additional Skills''': Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.
  
*To understand basic features of the human genome and the distribution of mutations among individuals.
+
These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.
*To understand the principles of segregation and linkage as they apply to human pedigree analysis and the identification of genetic variations associated with disease.
 
*To learn population and quantitative genetic concepts that are necessary in order to study the relationship between genetic variation and disease variation in populations.
 
*To learn about prototypical gene-disease relationships that are important to public health.
 
*To understand the key issues in genetic testing in populations.
 
*To understand the genetic complexity of common chronic disease.
 
*To have a basic understanding of the importance and biological basis of epigenetic mechanisms, gene-environment interactions, and gene-gene interactions.
 
  
===Theory===
+
=== Theory: The Human Genome and Mutation ===
*'''Public Heath Genetics:'''some current and potential applications of genome research include:
+
The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000–25,000 genes that orchestrate cellular functions, development, and responses to the environment. ''Mutations'', changes in this DNA sequence, can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.
**Molecular medicine: improved diagnosis of disease; earlier detection of genetic predispositions to disease; rational drug design; gene therapy and control systems for drugs; pharmacogenomics “custom drugs”.
 
**Microbial genomics: new energy sources (biofuels); environmental monitoring to detect pollutants; protection from biological and chemical warfare; safe, efficient toxic waste cleanup.
 
**Risk assessment: assess health damage and risks caused by radiation exposure, include low-dose exposures; assess health damage and risks caused by exposure to mutagenic chemicals and cancer causing toxins.
 
**Bio-archaeology, anthropology, evolution, and human migration: study evolution through germline mutations in lineages; study migration of different population groups based on X chromosome or Y chromosome; compare breakpoints in the evolution of mutations with ages of populations and historical events.
 
**DNA forensics (identification): identify potential suspects whose DNA may match evidence left at crime scenes; exonerate persons wrongly accused of crimes; identify crime and catastrophe victims; establish paternity and other family relationships; determine pedigree for seed or livestock breeds.
 
**Agriculture, livestock breeding, and bioprocessing: more nutritious produce; Biopesticides; healthier, more productive, disease-resistant farm animals; new environmental cleanup uses for plants like tobacco.
 
  
*'''The Human Genome and Mutation'''
 
 
<center>
 
<center>
[[Image:SMHS_Epidem_Fig_1.png |400px]]
+
[[Image:SMHS_Epidem_Fig_1.png |650px|thumb|Figure 1: Illustration of human chromosomal structure, highlighting key features like centromeres, telomeres, and banding patterns.]]
 
</center>
 
</center>
  
*Chromosomes are highly condensed DNA:
+
==== Chromosomal Structure ====
**Chromosomal banding pattern: condensed chromosomes can be stained to create the appearance of dark and light bands; dark bands represent regions rich in As and Ts; each band contains millions of DNA nucleotides; each chromosome has a unique banding pattern.
+
Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.
**A human karyotype: depicts the entire chromosomal constitutions of a person; normal karyotypes have 46 chromosomes; we get 23 chromosomes from each parent (22 autosomes and 1 sex chromosome).
+
* '''Banding''': Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
**Chromatin: composed of DNA and proteins that are associated with the chromosomes. (1) Euchromatin: lightly condensed DNA; gene rich, often actively transcribed. (2) Heterochromatin: highly condensed DNA; often composed of repetitive DNA elements; centromeres and telomeres.
+
* '''Karyotype''': The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
**Centromeres: large arrays of repeated DNA sequences; spindle fibers attach during mitosis to separate sister chromatids.
+
* '''Functional Elements''':
**Telomeres: arrays of repeated DNA sequences that are often thousands of bases in length; a “cap” at the end of chromosome to provide stability; due to the way that chromosomes replicated, telomeres, shorten with each cell division in human somatic cells.
+
  * '''Centromeres''': Central constrictions composed of repetitive alpha-satellite DNA sequences. They serve as attachment points for spindle fibers during mitosis and meiosis, ensuring proper chromosome segregation. Centromeric dysfunction can lead to aneuploidy (abnormal chromosome numbers).
**International system for human cytogenetic nomenclature: short arms of a chromosome are labeled; long arms are labeled; chromosome bands are labeled p11, p12, etc. like a zip code; the terminal ends of the chromosomes are labeled ter; where the arms meet in the middle is the centromere.
+
  * '''Telomeres''': Protective caps at chromosome ends, consisting of repetitive TTAGGG sequences (in humans) bound by shelterin proteins. Telomeres shorten with each somatic cell division due to the "end-replication problem," contributing to cellular aging (senescence). In germ cells and stem cells, telomerase enzyme maintains telomere length. Shortened telomeres are linked to diseases like dyskeratosis congenita and increased cancer risk.
**Genes are located on chromosomes: there are 45 bands on chromosome 5; chromosome 5 contains 1633 genes; chromosome 5 ~ 181000000 bases long; genes are referred to by their chromosomal location.
+
  * '''Chromatin''': The complex of DNA and proteins (histones) that forms chromosomes. It exists in two states:
**Chromosomal abnormalities: there are two types of abnormalities that can occur on a chromosomal level in humans: (1) structural abnormalities – missing, extra, or rearranged genetic material on one particular chromosome; (2) numerical abnormalities – deviations in the total number of chromosomes that an individual has.
+
    * '''Euchromatin''': Loosely packed, transcriptionally active, and rich in genes and regulatory elements.
 +
    * '''Heterochromatin''': Tightly packed, transcriptionally silent, often containing repetitive DNA (e.g., satellites, transposons) near centromeres and telomeres. Epigenetic modifications (e.g., histone methylation) regulate chromatin states.
 +
 
 +
==== Mutations and Abnormalities ====
 +
Mutations are heritable changes in the DNA sequence, occurring at a rate of about 10⁻⁸ per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.
  
Changes in chromosome structure.
 
 
<center>
 
<center>
[[Image:SMHS_Epidemiology_Fig_2.png|400px]]
+
[[Image:SMHS_Epidemiology_Fig_2.png|400px|thumb|Figure 2: Schematic of common structural chromosomal abnormalities, including deletion, duplication, translocation, and inversion.]]
 
</center>
 
</center>
  
 +
* '''Structural Abnormalities''' (Large-scale changes visible under a microscope, affecting thousands to millions of base pairs):
 +
  * '''Deletion''': Removal of a chromosomal segment, leading to loss of genes (e.g., cri-du-chat syndrome from 5p deletion).
 +
  * '''Duplication''': Extra copy of a segment, potentially causing gene dosage imbalances (e.g., Charcot-Marie-Tooth disease from PMP22 duplication).
 +
  * '''Translocation''': Exchange of segments between non-homologous chromosomes. Balanced translocations may be asymptomatic but increase risks in offspring; unbalanced ones cause disorders (e.g., chronic myeloid leukemia from t(9;22) "Philadelphia chromosome").
 +
  * '''Inversion''': Reversal of a segment within a chromosome, which can disrupt genes or lead to abnormal recombination (e.g., hemophilia A inversions).
 +
  * '''Other Types''': Isochromosomes (duplicated arms, e.g., i(Xq) in Turner syndrome) or ring chromosomes (circular formations from deletions at both ends).
 +
* '''Point Mutations''' (Small-scale changes at the nucleotide level, detected via sequencing):
 +
  * '''Nucleotide Substitution''': Replacement of one base with another.
 +
    * '''Silent''': No amino acid change (due to codon degeneracy).
 +
    * '''Missense''': Amino acid change (e.g., sickle cell anemia from GAG to GTG in beta-globin gene).
 +
    * '''Nonsense''': Introduces premature stop codon, truncating the protein (e.g., cystic fibrosis).
 +
  * '''Indels''': Insertions or deletions of nucleotides. Small indels can cause frameshifts, altering the reading frame and producing dysfunctional proteins (e.g., Tay-Sachs disease).
 +
  * '''Splice Site Variation''': Mutations in intronic regions affecting mRNA splicing, leading to exon skipping or inclusion of introns (e.g., beta-thalassemia).
 +
* '''Epidemiological Relevance''': Mutations' population distribution informs disease prevalence. Rare mutations cause Mendelian disorders (e.g., Huntington's), while common variants (SNPs) contribute to complex traits via polygenic risk scores. Tools like next-generation sequencing (NGS) enable large-scale mutation detection in cohort studies.
 +
 +
=== Population Genetics ===
 +
Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.
 +
 +
==== Allele and Genotype Frequencies ====
 +
The '''gene pool''' is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.
 +
* '''Allele Frequency''': Proportion of a specific allele at a locus. For a biallelic locus, frequencies sum to 1.
 +
 +
<math>\text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}.</math>
 +
 +
:  Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (2*40 + 50) / 200 = 0.65; q (a) = (2*10 + 50) / 200 = 0.35.
 +
* '''Genotype Frequency''': Proportion of individuals with a specific genotype (e.g., P(AA) = number of AA / total individuals).
 +
* '''Applications''': Used in GWAS to compare frequencies between cases and controls; deviations can indicate associations.
 +
 +
==== Hardy-Weinberg Equilibrium (HWE) ====
 +
HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p):
 +
 +
: <math>P(AA) = p^2\ (homozygous\ dominant)</math>
 +
 +
: <math>P(Aa) = 2pq\ (heterozygous)</math>
 +
 +
: <math>P(aa) = q^2\ (homozygous\ recessive)</math>
 +
 +
* '''Deviations from HWE''': Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.
 +
 +
* '''Testing for HWE''': Use Chi-squared goodness-of-fit test.
 +
 +
: Formula:
 +
<math>\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},</math>
 +
df=1 for biallelic loci.
 +
 +
: Critical value: >3.84 for p<0.05 (reject HWE).
 +
 +
: Example:
 +
:: Observed: 400 AA, 500 Aa, 100 aa (total 1000). p=0.65, q=0.35.
 +
:: Expected: 422.5 AA, 455 Aa, 122.5 aa.
 +
:: χ² ≈ 10.5 (p<0.05, deviation).
 +
 +
'''R Implementation for HWE:'''
 +
This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the ''HardyWeinberg'' package.
 +
 +
<pre>
 +
# Hardy-Weinberg Equilibrium Test
 +
# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
 +
obs <- c(AA = 400, Aa = 500, aa = 100)  # Example observed counts
 +
 +
# Input validation
 +
if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")
  
Deletion: 46, XY, del(6) (p16.3)    Terminal deletion with breakpoint at 6p16.3
+
total <- sum(obs)
Duplication: 46, XX, dup(1) (q22q25)  Duplication of chromosome 1 region q22 to q25
+
if (total == 0) stop("No individuals in population")
Insertion: 46, XY, ins(2;5) (p13;q21q31)  An insertion of chromosome 5q21-31 into chromosome 2p13
 
Translocation: 46, XX, t(2;6) (q35;p21.3) A balanced reciprocal translocation with breakpoints in 2q35 and 6p21.3
 
Inversion: 46, XY, inv(11) (p11p15)  An inversion on chromosome 11 with breakpoints at p11 and p15
 
*Mutations: caused by changes in the DNA sequence; there are many different types of mutations; can happen in somatic cells or during development of gametes.
 
Types of mutations: (1) Nucleotide substitutions, involve an alteration in the sequence but not the number of nucleotides (DNA bases) in a gene; (2) Insertions & Deletions, involve an alteration in the number of nucleotides in a gene; (3) trinucleotide repeats, involves an alteration in the number of times that a certain sequence of three bases repeats itself; (4) Splice Site Variation, involve an alteration in the non-coding region of a gene, which affects the way that parts of the gene sequence are combined to make RNA.
 
Results of mutations: mutations in exons may result in – misspelling of protein (missense), truncation of the protein (nonsense), no effect; mutations in introns may result in – no effect, altered regulations of gene expression, splice site variation.
 
  
*'''Genes in Population:'''
+
# Calculate allele frequencies
*Gene pool: all available genetic variation in a population; all potential mating combinations.
+
p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total)  # Frequency of A
*Basic concepts:
+
q <- 1 - p  # Frequency of a
**Alleles: the type of genetic variation seen at a particular location on a chromosome. (1) fictional form: big “A” allele, little “a” allele; (2) base pair form: T allele at basepair 71349562 on chromosome 2, C allele at basepair 71349562 on chromosome 2; (3) codon form: Arginine at codon 124, Glycine at codon 124.
+
 
**Genotypes: we inherit one allele from our mother and one allele from our father to form our genotype. Variation in a single gene like AA, Aa or aa.
+
# Expected counts under HWE
**Haplotypes: it is the combination of alleles that an individual has at multiple sites along a chromosome.
+
expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)
**Allele frequencies: the prevalence of a particular allele in a given population. Allele frequency = $\frac {Number\, of\, alleles\,}{2*(number\, of\, people\,)}$.
+
 
*Genotype frequencies: prevalence of a particular genotype in a given population.  
+
# Chi-squared test (ensure expected >5 for validity)
**Haplotype frequencies: frequency that a haplotype occurs in different ethnic groups.
+
if (any(expected < 5)) warning("Expected counts <5; consider exact test")
**Hardy-Weinberg disequilibrium: when genotype frequencies in a population differ from what would be predicted based on allele frequencies.
+
 
**:Hardy-Weinberg Equilibrium (HWE): in a stable population with random mating, allelic frequencies typically predict genotype frequencies using the law of independent segregation. When allele frequencies can accurately be used to predict genotype frequencies in a population, the population is considered to be in a HWE.
+
chi2 <- sum((obs - expected)^2 / expected)
**:HWE: suppose a SNP that can only be A or C $(p_{A}+p_{C}=1)$, and probability of having A allele is $p_{A}$, C allele is $p_{C}$, then under HWE, the probability of AA genotype = $p_{A}^{2}$, probability of having CC = $p_{C}^{2}$, the probability of having AC = $2p_{A}p_{C}$.
+
df <- 1  # Degrees of freedom for biallelic
**:Steps to test HWE: (1) estimate allele frequencies; (2) calculate the expected relative genotype frequencies under HWE; (3) calculate the expected number of people with each genotype; (4) calculate the difference between observed and expected number of people with each genotype using $χ^2$ formula; (5) sum up the $χ^2$ components and compare the sum to statistical tables to see if there is significant deviation.
+
p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)
 +
 
 +
# Optional: Fisher's exact test for small samples (using contingency table simulation)
 +
# But for simplicity, we use chi-squared here
 +
 
 +
# Output results
 +
cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")
 +
cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")
 +
cat("Chi-squared =", round(chi2, 4), "df =", df, "p-value =", format.pval(p_val_chi), "\n")
 +
if (p_val_chi < 0.05) {
 +
  cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
 +
} else {
 +
  cat("Fail to reject HWE: Population appears in equilibrium.\n")
 +
}
 +
</pre>
 +
 
 +
=== Pedigree Analysis and Inheritance ===
 +
Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.
  
 
<center>
 
<center>
{|class="wikitable" style="text-align:center;width:50%" Border="1"
+
[[Image:MSHS IntroEpi Fig 5 .png|600px|thumb|Figure 5: Example pedigree illustrating inheritance patterns, with symbols for affected individuals, carriers, and relationships across generations.]]
|-
 
| || Observed|| Expected ||  $X^{2}$ component
 
|-
 
|AA || $N_{AA}$ || $p^{2}(N_{Total})$ || $\frac{(O_{AA}- E_{AA})^{2}}{E_{AA}}$
 
|-
 
|Aa || $N_{AS}$ || $2pq^{2}(N_{Total})$ || $\frac{(O_{Aa}- E_{Aa})^{2}}{E_{Aa}}$
 
|-
 
|aa ||$N_{aa}$ || $q^{2}(N_{Total})$ || $\frac{(O_{aa}- E_{aa})^{2}}{E_{aa}}$
 
|-
 
| || $N_{Total}$ || $N_{Total}$ || Overall $X^{2}$ Statistics
 
|-
 
|}
 
 
</center>
 
</center>
  
$N_{AA}$ is the actual number of people in the population with AA genotype, $N_{Total}$ is the number of people in the population, $p^{2}(N_{Total})$ calculated expected number of people with AA genotype.
+
==== Modes of Inheritance ====
For $χ^{2}$, the null hypothesis $H_{0}$: the population is in HWE. For two alleles, if the overall $χ^{2}$ is less than or equal to 3.84, then p-value is greater than 0.05 and don’t reject $H_{0}$, population is in HWE; if $χ^{2}$ more than 3.84, p-value is less than 0.05, reject $H_{0}$, population is in HWD (disequilibrium).
+
Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.
 +
 
 +
* '''Autosomal Dominant (AD)''':
 +
  * '''Definition''': A single copy of the dominant allele (D) is sufficient to express the phenotype. Affected individuals are heterozygous (Dd) or homozygous (DD), while dd are unaffected.
 +
  * '''Pedigree Characteristics''': Vertical transmission (affected individuals in every generation); no skipped generations if fully penetrant. Affects males and females equally; 50% risk to offspring of an affected parent.
 +
  * '''Example''': Huntington's disease (CAG repeat expansion in HTT gene on chromosome 4). Late-onset, neurodegenerative disorder with high penetrance.
 +
  * '''Epidemiological Notes''': Often rare; founder effects in isolated populations increase prevalence. Used in linkage studies to map genes.
 +
  * '''Limitations''': Incomplete penetrance or variable expressivity (different symptom severity) can mimic other patterns.
 +
 
 +
* '''Autosomal Recessive (AR)''':
 +
  * '''Definition''': Requires two copies of the recessive allele (dd) for expression. Heterozygotes (Dd) are asymptomatic carriers.
 +
  * '''Pedigree Characteristics''': Horizontal transmission (affected siblings from unaffected parents); skips generations. Consanguinity (e.g., cousin marriages) increases risk; equal male-female incidence.
 +
  * '''Example''': Cystic fibrosis (mutations in CFTR gene on chromosome 7). Affects lung and digestive functions; carrier frequency ~1/25 in Caucasians.
 +
  * '''Epidemiological Notes''': Higher prevalence in populations with high carrier rates (e.g., due to heterozygote advantage, like sickle cell anemia in malaria-endemic areas). Screening programs target carriers.
 +
  * '''Limitations''': Phenocopies (environmental mimics) can complicate diagnosis.
 +
 
 +
* '''X-Linked Recessive (XR)''':
 +
  * '''Definition''': Mutation on the X chromosome; males (X^c Y) express with one copy due to hemizygosity, while females (X^C X^c) are carriers (often unaffected due to X-inactivation).
 +
  * '''Pedigree Characteristics''': No male-to-male transmission; affected males pass to all daughters (carriers) but no sons. Mother-to-son transmission; more males affected.
 +
  * '''Example''': Hemophilia A (mutations in F8 gene on Xq28). Bleeding disorder; historical prevalence in European royalty.
 +
  * '''Epidemiological Notes''': Skewed sex ratios in affected individuals; carrier testing in families. Lyonization (random X-inactivation) can cause mild symptoms in females.
 +
  * '''Limitations''': De novo mutations can appear without family history.
 +
 
 +
* '''Additional Modes''':
 +
  * '''X-Linked Dominant (XD)''': Rare; affects females more (e.g., Rett syndrome, MECP2 gene). Lethal in males or mild; female-to-offspring transmission.
 +
  * '''Y-Linked''': Male-only transmission (e.g., azoospermia factors on Y chromosome). Rare, as Y has few genes.
 +
  * '''Mitochondrial''': Maternal inheritance (mtDNA from mother); affects both sexes but no father-to-child. Variable due to heteroplasmy (mixed mtDNA). Example: Leber's hereditary optic neuropathy.
 +
 
 +
==== Probability in Pedigrees ====
 +
Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for [[AP_Statistics_Curriculum_2007_Prob_Rules | Bayesian]] risk assessment in genetic counseling.
 +
 
 +
* '''Key Formula''': The likelihood of observing a pedigree under a genetic model is:
 +
  <math>P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i),</math>
 +
 
 +
:where the product is over all individuals, <math>P(\text{genotype}_i)</math> is based on Mendelian laws and parental genotypes, and <math>P(\text{phenotype}_i | \text{genotype}_i)</math> accounts for penetrance.
 +
 
 +
* '''Penetrance''': Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.
 +
 
 +
* '''Variable Expressivity''': Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).
 +
 
 +
* '''Example''': In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.
 +
 
 +
* '''Applications''': Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).
 +
 
 +
* '''Limitations''': Assumes accurate family history; ignores modifiers like environment.
 +
 
 +
=== Linkage Analysis ===
 +
Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.
 +
 
 +
* '''Genetic Linkage''': Occurs when genes are close on the same chromosome, violating independent assortment.
 +
* '''Recombination Fraction (θ)''': Proportion of gametes with recombination between two loci.
 +
: <math>\theta = 0.5</math>: No linkage (independent, >50 cM apart).
 +
: <math>\theta < 0.5</math>: Linkage (e.g., θ=0.1 means 10% recombination).
 +
: <math>\theta = 0</math>: Complete linkage (no recombination, syntenic loci).
 +
: Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.
 +
 
 +
* '''Phases''': Coupling (alleles on same chromosome) vs. repulsion (opposite).
 +
 
 +
* '''Applications''': Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).
 +
 
 +
==== LOD Score ====
 +
The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.
 +
* '''Formula''': <math>Z(\theta) = \log_{10} \frac{L(\text{data} | \theta=\hat{\theta})}{L(\text{data} | \theta=0.5)}</math>, where \(\hat{\theta}\) is the maximum likelihood estimate of θ, and L is the likelihood function.
 +
* '''Calculation''': Involves summing over possible phases and recombinants in pedigrees.
 +
* '''Interpretation''': Positive Z favors linkage; threshold Z≥3 for significance (false positive rate 5%); Z≤-2 excludes linkage.
 +
  {|class="wikitable" style="text-align:center;width:50%" Border="1"
 +
  |-
 +
  ! LOD Score !! Odds Ratio !! Interpretation
 +
  |-
 +
  | Z ≤ -2 || ≤1:100 || Strong evidence against linkage (exclusion)
 +
  |-
 +
  | -2 < Z < 3 || Indeterminate || Suggestive but not conclusive
 +
  |-
 +
  | Z ≥ 3 || ≥1000:1 || Strong evidence for linkage (genome-wide significance)
 +
  |}
 +
* '''Example''': In a family with a disease and marker, if data is 1000 times more likely under θ=0.1 than 0.5, Z=3.
 +
* '''Limitations''': Requires large pedigrees; sensitive to model misspecification (e.g., wrong penetrance). Superseded by GWAS for complex diseases.
 +
 
 +
=== Linkage Disequilibrium (LD) and Association ===
 +
LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.
 +
 
 +
* '''Causes of LD''': Mutation, selection, genetic drift, population admixture, or bottlenecks. High LD in isolated populations (e.g., Ashkenazi Jews).
 +
* '''Decay''': LD decreases with genetic distance and time; measured in haplotype blocks.
 +
* '''Vs. Linkage''': Linkage is family-based (meiotic); LD is population-based (historical).
 +
* '''Applications''': Imputation in GWAS; fine-mapping; inferring evolutionary history.
 +
 
 +
==== Measures of LD ====
 +
Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.
 +
 
 +
* '''D (Disequilibrium Coefficient)''':
 +
 
 +
: Formula: <math>D_{AB} = p_{AB} - p_A p_B.</math>
 +
 
 +
: Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.
  
*'''Pedigree Analysis and Probability in Genetics'''
+
: Limitations: Sensitive to allele frequencies; not comparable across loci.
**Mendel’s Law of Segregation: organism carry two copies of each genetic factor; there is segregation of parental factors during gamete formation; each gamete receives one genetic factor from each parent.
+
 
<center>
+
* '''D' (Normalized D)''':
[[Image:MSHS IntroEpi Fig 4 .png|400px]]
+
 
</center>
+
: Formula: if <math>D<0</math>, <math>D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))}</math>
 +
::: if  <math>D>0</math>, <math>D' = \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))}</math>.
 +
 
 +
: Interpretation: Ranges -1 to 1; |D'|=1: Complete LD (no recombination evidence); |D'|<1: Partial.
 +
 
 +
: Use: Detects historical recombination.
 +
 
 +
* '''r² (Correlation Coefficient)''':
 +
 
 +
: Formula: <math>r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}</math>.
 +
 
 +
: Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).
 +
 
 +
: Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.
 +
 
 +
==== Additional Considerations ====
 +
* '''Haplotypes''': Combinations of alleles (e.g., estimated via EM algorithm in PHASE software).
 +
* '''Visualization''': LD plots (triangular heatmaps) using tools like Haploview.
 +
* '''Best Practices''': Account for population structure (e.g., via principal components) to avoid spurious LD. In epidemiology, LD informs polygenic risk scores.
 +
* '''Limitations''': LD varies by ancestry (e.g., higher in Europeans than Africans); tagging fails for rare variants.
 +
 
 +
'''R Implementation for LD:'''
 +
This code computes D, D', and r² from example haplotype frequencies or counts. It uses base R for simplicity; for real data, consider packages like ''genetics'' or ''snpStats''.
 +
 
 +
<pre>
 +
# Linkage Disequilibrium Measures
 +
# Input: Haplotype counts (e.g., from population data)
 +
# Assume biallelic loci A/a, B/b
 +
count_AB <- 120  # AB haplotypes
 +
count_Ab <- 80  # Ab
 +
count_aB <- 60  # aB
 +
count_ab <- 140  # ab
 +
total <- sum(c(count_AB, count_Ab, count_aB, count_ab))
 +
 
 +
# Haplotype frequencies
 +
p_AB <- count_AB / total
 +
p_Ab <- count_Ab / total
 +
p_aB <- count_aB / total
 +
p_ab <- count_ab / total
  
 +
# Allele frequencies
 +
p_A <- p_AB + p_Ab
 +
p_a <- 1 - p_A
 +
p_B <- p_AB + p_aB
 +
p_b <- 1 - p_B
  
*'''Human Pedigree Nomenclature:'''
+
# D
<center>
+
D <- p_AB - p_A * p_B
[[Image:MSHS IntroEpi Fig 5 .png|400px]]
 
</center>
 
  
 +
# D' (normalized)
 +
D_max_pos <- min(p_A * p_b, p_a * p_B)
 +
D_max_neg <- -min(p_A * p_B, p_a * p_b)
 +
D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)
  
'''Modes of Inheritance:'''
+
# r-squared
*Autosomal dominant: individuals that inherit the dominant disease allele, D will develop the disease. Homozygous (DD: affected), Heterozygous (Dd: affected), Homozygous (dd: normal).
+
r_sq <- D^2 / (p_A * p_a * p_B * p_b)
'''Traits:''' every affected individuals has an affected parent; there is 50% chance that each affect parent will transmit the trait to any child; the trait is expressed in both males and females is roughly equal numbers; two affected individuals may have unaffected children; the phenotype is often more severe in homozygous affected individuals.
 
*Autosomal recessive: individuals that inherit two copies of the recessive disease allele, d, will develop the disease. Homozygous (DD: normal), Heterozygous (Dd: normal), Homozygous (dd: affected).
 
'''Traits:''' for rare traits, most affected individuals are the children of unaffected parents; all of the children of two affected individuals are affected; the risk of an affected child from a mating a two heterozygotes is 25%; the trait is expressed in both males and females; for rare traits, the unaffected parents of an affected individual may be related to each other.
 
*Sex-linked dominant (X-lined dominant): $X^{C} X^{C}$ affected female, $X^{C}$ Y affected male, $X^{C} X^{c}$ affected female, $X^{c}$ Y normal male, $X^{c} X^{c}$ normal female.
 
*Sex-linked recessive (X-linked recessive): female that carry a recessive mutation $X^{C}$ will have affected male children. $X^{C} X^{C}$ normal female, $X^{C}$ Y normal male, $X^{C} X^{c}$ carrier female, $X^{c}$ Y affected male, $X^{c}X^{c}$ affected female.
 
'''Traits:''' there is no male to male transmission; there is mother to son transmission; female can be homozygous and have the trait. Examples may be color blindness.
 
  
*Probability in Pedigree Analysis: using Mendel’s Laws, we can estimate the probabilities of an offspring’s genotype if we know (or assume) a mode of inheritance; under Hardy-Weinberg, we can estimate genotype probabilities for parents.
+
# Output
**Steps: (1) choose a mode of inheritance; (2) establish the penetrance of each genotype under that mode of inheritance; (3) determine the potential genotypes of each person under that mode of inheritance; (4) determine the founder genotype probabilities (parent generation); (5) determine the transmission probabilities (offspring generation); (6) calculate the probabilities of each pedigree member given their phenotype, their genotype, and the penetrance of the disease ''P(member)=P(phenotype and genotype)=P(genotype)*P(phenotype|genotype);'' (7) calculate the total probability of the pedigree ''P(pedigree)''$=∏_{i=1}^{n}$''P(genotype)*P(phenotype|genotype)'', n is number of people in the pedigree.
+
cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")
 +
cat("D =", round(D, 4), "\n")
 +
cat("D' =", round(D_prime, 4), "\n")
 +
cat("r-squared =", round(r_sq, 4), "\n")
 +
if (abs(D_prime) == 1) cat("Complete LD detected.\n") else cat("Partial or no LD.\n")
 +
</pre>
  
In step 2, there can be complete penetrance or incomplete penetrance. With complete penetrance, individuals’ phenotype will always match their genotype.
 
If a genotype has incomplete penetrance, some individuals with the ‘affected’ genotype will not exhibit the ‘affected’ phenotype. This happens often when the development of the phenotype is controlled by more than one gene or is modified by environmental factors.
 
  
*Founder effect: in small populations, rare recessive alleles present in a member of the original group of settlers is transmitted through successive generations; population expands and remains geographically and culturally isolated. After ~10 generations, children with recessive disease begin to appear, inbreeding is not usually a significant feature of the population.
+
==== Genome-Wide Association Studies (GWAS) ====
Consanguineous mating: inbreeding (consanguinity) = mating between genetically related individuals; degree of inbreeding based on the probability that an individual will inherit two alleles that came from a common ancestor; homozygousity due to inheritance of alleles that are “Identical by Descent” (IBD).
 
  
 +
GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.
  
'''Linkage Analysis'''
+
* Statistical Model: Typically uses logistic regression for case-control studies.
*Linkage concepts: genetic linkage refers to the study of the order of genes on chromosomes; distance between genes (aka genetic distance).
+
: <math>\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}.</math>
*Recombination fraction: a measure of distance between genes; alleles that are physically very close to one another on a chromosome tend not be separated by recombination as often as alleles that are physically far from one another. The symbol $\theta$ is used to show the probability that the alleles of two genes will recombine during gamete formation, it equals to the proportion of gametes that are recombinant = probability of recombination = recombination fraction.
 
*When two loci are inherited independently of each other, recombinants and non-recombinants are found in equal proportions in the offspring: $\theta$=0.5.
 
*When two loci are inherited together because of chromosome location, there are more non-recombinants than recombinants in the offspring: $0\le\theta\le0.5$.
 
*$\theta=0.5$ implies no linkage; $\theta=0$ implies complete linkage; $0<\theta<0.5$ implies linkage.
 
  
'''Parametric Linkage Analysis:''' requires to (1) collect pedigree data with many meiotic events (need multiple generations or many children); (2) make assumptions about how the disease is inherited (single locus vs. multilocus; dominant vs. recessive; penetrance; allele frequencies of disease susceptibility locus); (3) can be done with phase known (know how the alleles are distributed on parental chromosomes) or phase unknown (know which alleles come from which parent, but not how they are distributed on the chromosomes) pedigrees.
+
* Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., <math>p < 5 \times 10^{-8}</math>) are required to avoid false positives.
*For phase known pedigree: recombination fraction $\theta$=$\frac{\#\,recombinants}{\#\,informative\,meiosis'}$, where informative meiosis = parental gamete formation that provides information about recombination between two loci.
 
*For phase unknown pedigree: parent haplotypes are not known, $\theta$ cannot be estimated directly because there is no way to tell whether the offspring are recombinant or non-recombinant.
 
*Estimation of $\theta$ by the Maximum Likelihood Method: estimate $\theta$ for any pedigree by using MLE. This equation takes the data in the pedigree and asks the question, what $\theta$ would result in the largest L($\theta$) given the number of recombinant and non-recombinant gametes? L($\theta)=c(1-\theta)^{n-k}\theta^{k}$, where c is a constant, n is the number of informative meioses, k is the number of recombinant meioses, n-k is the number of non-recombinant
 
  
*''lnL''($\theta$)=''lnc+(n-k'') ln(1-$\theta$)+''kln$\theta$'', $\frac{∂lnL\theta}{∂\theta}=\frac {-n-k} {1-\theta}+\frac{k}{\theta}$=0,=$\hat {\theta}$ = $\frac{k}{n}$.
+
=== Gene-Environment Interactions ===
*Logarithm (base 10) Of Odds (LOD) score: The LOD score compares the likelihood of observing the test data if the two loci are indeed linked, to the likelihood of observing the same data purely by chance. LOD>0 indicate the presence of linkage, LOD<0 indicate that linkage is less likely.
+
Disease risk is often modeled as a combination of genetics (<math>G</math>), environment (<math>E</math>), and their interaction (<math>G \times E</math>).
 +
: <math>Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon.</math>
  
In linkage analysis, the two hypotheses are  $H_{0}$: no linkage, $\theta$=0.5; $H_{1}$linkage, $\theta=\hat{\theta}$.LOD score Z($\theta$)=$log_{10}\frac{L(\theta=\hat{\theta})}{L(\theta=\frac{1}{2})}$,where $L(\theta=\hat{\theta}$ is the likelihood equation when $\theta=\frac{1}{2}$.
+
Interaction Models:
 +
* Synergistic: Genotype exacerbates the risk factor (or vice versa).
 +
* Independent: Both factors influence risk but do not interact.
  
 
<center>
 
<center>
{|class="wikitable" style="text-align:center;width:50%" Border="1"
+
[[Image:SMHS Epi Figure 11.png| 500 px]]
 +
<br>''Model: Genotype exacerbates the effect of the risk factor''
 +
</center>
 +
 
 +
=== Core Epidemiological Measures ===
 +
 
 +
In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.
 +
 
 +
==== Absolute Risk Reduction (ARR) ====
 +
* '''Definition''': The difference in event rates (incidences) between a control (or unexposed) group and a treatment (or exposed) group. ARR measures the absolute effect of an intervention or exposure on outcome risk.
 +
* '''Formula''': <math>ARR = I_{\text{control}} - I_{\text{treatment}}</math>, where \(I\) represents incidence (proportion of events).
 +
* '''Interpretation''': A positive ARR indicates risk reduction (benefit); a negative ARR indicates increased risk (harm). It is straightforward but does not account for baseline risk.
 +
* '''Example''': If the incidence of heart attacks is 10% in the control group and 7% in the treatment group, ARR = 0.10 - 0.07 = 0.03 (3% absolute reduction).
 +
* '''When to Use''': Prospective studies like randomized controlled trials (RCTs); useful for communicating tangible benefits to patients.
 +
* '''Limitations''': Sensitive to baseline risk; not ideal for comparing across populations with different event rates.
 +
 
 +
==== Relative Risk (RR) ====
 +
* '''Definition''': The ratio of the incidence of an outcome in the exposed group to that in the unexposed group. RR assesses how much an exposure increases or decreases the probability of an event.
 +
* '''Formula''': <math>RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}</math>.
 +
* '''Interpretation''': RR > 1 indicates increased risk due to exposure; RR < 1 indicates protective effect; RR = 1 indicates no association. It is multiplicative and accounts for baseline risk.
 +
* '''Example''': In a cohort study, smokers have a 20% lung cancer incidence, while non-smokers have 2%. RR = 0.20 / 0.02 = 10 (smokers are 10 times more likely to develop lung cancer).
 +
* '''When to Use''': Cohort studies or RCTs; preferred for common outcomes.
 +
* '''Limitations''': Can overestimate associations for rare events; not applicable in case-control studies.
 +
 
 +
==== Odds Ratio (OR) ====
 +
 
 +
A 2×2 contingency table with odds ratio formula.
 +
 
 +
{| class="wikitable" style="text-align:center; width:60%;"
 +
|+ '''2 × 2 Contingency Table for Odds Ratio'''
 +
!
 +
! Exposed
 +
! Unexposed
 +
! Total
 
|-
 
|-
|   $\theta$    || LOD score
+
! Cases (Diseased)
 +
| '''a'''
 +
| '''c'''
 +
| a + c
 
|-
 
|-
| $\theta>0$  || $Z (\theta) = nlog(2) + k * log(\theta) + (N - k)log (1-\theta)$
+
! Controls (Healthy)
 +
| '''b'''
 +
| '''d'''
 +
| b + d
 
|-
 
|-
| $\theta = 0$and k = 0  || $Z (\theta) = nlog(2)$
+
! Total
 +
| a + b
 +
| c + d
 +
| '''N'''
 
|-
 
|-
| $\theta = 0$ and k > || $Z(\theta) =-\infty$
+
| colspan="4" style="text-align:left; background:#f9f9f9;" |
|-
+
'''Odds Ratio (OR) Calculation:'''
|}
+
$$OR = \frac{a / c}{b / d} = \frac{ad}{bc}$$
</center>
+
|}
 +
 
 +
 
 +
Formula: <math>OR = \frac{ad}{bc}</math>
 +
 
 +
: Odds Ratio = [Exposed Cases × Unexposed Non-cases] / [Exposed Non-cases × Unexposed Cases].
 +
 
 +
* '''Definition''': The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group. OR approximates RR when the outcome is rare.
 +
* '''Formula''': From a 2x2 contingency table (a = exposed cases, b = exposed non-cases, c = unexposed cases, d = unexposed non-cases): <math>OR = \frac{ad}{bc}</math>.
 +
* '''Interpretation''': OR > 1 suggests positive association; OR < 1 suggests inverse association; OR = 1 suggests no association. It is often used in logistic regression.
 +
* '''Example''': In a case-control study of diabetes and obesity: 80 obese diabetics (a), 20 obese non-diabetics (b), 30 non-obese diabetics (c), 70 non-obese non-diabetics (d). OR = (80*70) / (20*30) = 9.33 (obesity increases odds of diabetes by over 9 times).
 +
* '''When to Use''': Case-control studies or when incidence data is unavailable; common in meta-analyses.
 +
* '''Limitations''': Not directly interpretable as risk for common outcomes; can differ from RR if events are frequent.
 +
 
 +
==== Number Needed to Treat (NNT) or Harm (NNH) ====
 +
* '''Definition''': The average number of patients who need to be treated (or exposed) to prevent (or cause) one additional outcome compared to the control. NNT is based on ARR and translates statistical effects into clinical relevance.
 +
* '''Formula''': <math>NNT = \frac{1}{|ARR|}</math> (use absolute value for magnitude; sign of ARR determines benefit vs. harm).
 +
* '''Interpretation''': Lower NNT indicates greater treatment efficacy. If ARR > 0, it's NNT (benefit); if ARR < 0, it's NNH (harm). Infinite NNT means no effect.
 +
* '''Example (Benefit)''': ARR = 0.03 (as above), NNT = 1 / 0.03 ≈ 33.3 (treat 33 patients to prevent one heart attack).
 +
* '''Example (Harm)''': If treatment increases incidence from 50% to 80%, ARR = 0.50 - 0.80 = -0.30, NNH = 1 / 0.30 ≈ 3.3 (treat 3 patients to cause one additional bad outcome).
 +
* '''When to Use''': RCTs or systematic reviews; helps in shared decision-making and cost-benefit analysis.
 +
* '''Limitations''': Assumes constant ARR; sensitive to time frame and baseline risk. Confidence intervals should be reported for real-world application.
 +
 
 +
==== Additional Considerations ====
 +
* '''Confidence Intervals (CI)''': Always compute 95% CIs for these measures to assess precision (e.g., using bootstrap methods or formulas in R packages like ''epitools'' or ''epiR'').
 +
* '''Attributable Risk (AR)''': Extends RR; AR = \(I_{\text{exposed}} - I_{\text{unexposed}}\) (absolute risk due to exposure).
 +
* '''Population Attributable Risk (PAR)''': PAR = \(I_{\text{population}} - I_{\text{unexposed}}\) (proportion of cases attributable to exposure in the population).
 +
* '''Best Practices''': Adjust for confounders using multivariable models; interpret in context (e.g., RR may seem large for rare events but have small absolute impact).
 +
* '''Software Tools''': R (with packages like ''epiR'', ''Epi'', or ''survival'' for advanced metrics like Hazard Ratios) or Python (with ''scipy'' or ''lifelines'') are commonly used.
 +
 
 +
'''R Implementation for Key Measures:'''
 +
This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.
 +
 
 +
<pre>
 +
# Install if needed: install.packages("epiR")  # But assuming it's available or use base R
 +
 
 +
# Example data: 2x2 table for OR/RR (cohort study assumption)
 +
# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
 +
a <- 20  # Exposed cases
 +
b <- 80  # Exposed non-cases
 +
c <- 2  # Unexposed cases
 +
d <- 98  # Unexposed non-cases
 +
 
 +
# Incidences
 +
I_exposed <- a / (a + b)
 +
I_unexposed <- c / (c + d)
 +
 
 +
# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
 +
ARR <- I_unexposed - I_exposed  # Positive if treatment reduces risk
 +
 
 +
# Relative Risk
 +
RR <- I_exposed / I_unexposed
 +
 
 +
# Odds Ratio
 +
OR <- (a * d) / (b * c)
 +
 
 +
# Number Needed to Treat/Harm
 +
NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
 +
type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))
 +
 
 +
# Output
 +
cat("Incidence Exposed:", round(I_exposed, 3), "\n")
 +
cat("Incidence Unexposed:", round(I_unexposed, 3), "\n")
 +
cat("ARR:", round(ARR, 3), "\n")
 +
cat("RR:", round(RR, 3), "\n")
 +
cat("OR:", round(OR, 3), "\n")
 +
cat(type, ":", ifelse(is.finite(NNT), round(NNT, 1), "Infinite"), "\n")
 +
 
 +
# Harm example (swap for treatment increasing risk)
 +
I_control <- 0.50
 +
I_treatment <- 0.80
 +
ARR_harm <- I_control - I_treatment
 +
NNT_harm <- 1 / abs(ARR_harm)
 +
cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")
 +
cat("NNH:", round(NNT_harm, 1), "\n")
 +
</pre>
 +
 
 +
 
 +
 
 +
The following code uses a custom function to handle both ''benefit'' (reduction in risk) and ''harm'' (increase in risk) scenarios dynamically.
 +
 
 +
<pre>
 +
# Function to calculate Epidemiological Metrics
 +
calculate_epi_stats <- function(a, b, c, d) {
 +
# Input validation
 +
if (any(c(a, b, c, d) < 0)) stop("Cell counts must be non-negative")
 +
 
 +
# Basic Incidences
 +
i_exp <- a / (a + b)
 +
i_unexp <- c / (c + d)
 +
 
 +
# Absolute Risk Reduction (ARR)
 +
# Defined as Control Risk - Treated Risk
 +
arr <- i_unexp - i_exp
 +
 
 +
# Relative Risk (RR) and Odds Ratio (OR)
 +
# Adding small constant (0.5) if zero to avoid division by zero errors
 +
rr <- i_exp / i_unexp
 +
or <- (a * d) / (b * c)
 +
 
 +
# NNT / NNH Logic
 +
nnt_val <- if (arr != 0) 1 / abs(arr) else Inf
 +
nnt_label <- ifelse(arr > 0, "NNT (Benefit)",
 +
ifelse(arr < 0, "NNH (Harm)", "No Effect"))
 +
 
 +
# Return formatted list
 +
return(list(
 +
Incidence_Exposed = round(i_exp, 4),
 +
Incidence_Unexposed = round(i_unexp, 4),
 +
ARR = round(arr, 4),
 +
RR = round(rr, 2),
 +
OR = round(or, 2),
 +
Measure = nnt_label,
 +
Value = round(nnt_val, 1)
 +
))
 +
}
 +
 
 +
# --- Example 1: Beneficial Treatment ---
 +
# (e.g., Vaccine trial: Exposed = Vaccinated)
  
To test one hypothesis on multiple pedigrees, add the LOD scores of each individual pedigree to determine a final LOD score:
+
result_benefit <- calculate_epi_stats(a=20, b=80, c=40, d=60)
 +
print(result_benefit)
  
$Z(\theta)=\sum Z_{i}(\theta)$ for i=1,,''n'' pedigrees
+
# --- Example 2: Harmful Exposure ---
 +
# (e.g., Smoking vs Lung Cancer)
 +
result_harm <- calculate_epi_stats(a=50, b=50, c=10, d=90)
 +
print(result_harm)
 +
</pre>
  
  
$\theta$ is the value of $\theta$ that maximizes L($\theta$) = MLE.
+
To understand the output and interpret the results, we use the following logic.
LOD scores correspond to the following odds:
 
  
<center>
+
{| class="wikitable" style="width:100%;"
{|class="wikitable" style="text-align:center;width:50%" Border="1"
+
! Metric
 +
! Interpretation
 +
! Calculation
 
|-
 
|-
|$Z(\theta)=-2$  || 100:1 odds against linkage, significantly in favor of no linkages
+
| '''ARR'''
 +
| The actual difference in risk between groups.
 +
| <math>I_{unexposed} - I_{exposed}</math>
 
|-
 
|-
|$Z(\theta)=-1$  || 10:1 odds against linkage
+
| '''RR'''
 +
| How many times more likely the outcome is in the exposed group.
 +
| <math>I_{exposed} / I_{unexposed}</math>
 
|-
 
|-
|$Z(\theta)=0$  || Not informative
+
| '''OR'''
 +
| The ratio of the odds of the outcome occurring in the exposed group.
 +
| <math>\frac{a \times d}{b \times c}</math>
 
|-
 
|-
|$Z(\theta)=+1$  || 10:1 odds in favor of linkage
+
| '''NNT'''
|-
+
| The number of patients needed to treat to prevent '''one''' bad outcome.
|$Z(\theta)=+2$ || 100:1 odds in favor of linkage
+
| <math>1 / ARR</math>
|-
+
|}
|$Z(\theta)=+3|| 1000:1 odds in favor of linkage, significantly in favor of linkage
+
 
|}
+
 
</center>
+
''Note on NNT/NNH'': If the ARR is positive (risk decreased), we report '''Number Needed to Treat (NNT)'''. If the ARR is negative (risk increased), we report '''Number Needed to Harm (NNH)'''.
 +
 
 +
 
 +
''Confidence Intervals:'' In a real-world study, we typically report 95% CIs. Using `prop.test()` in base R to get CIs for risk differences.
 +
 
 +
''Log Transformation:'' For OR and RR, calculations for CIs are typically done on the  scale to ensure symmetry.
 +
 
 +
===Rmd/R Epidemiological simulation of a hypothetical viral epidemic===
 +
 
 +
We'll try to illustrates genomics tracking, prediction, and spatio-temporal quantification
 +
of a viral epidemic. For this, we'll build a high-fidelity epidemiological simulation in R,
 +
integrating three distinct domains: ''compartmental modeling'' (SIR/SEIR), ''phylodynamics'' (genomic tracking), and ''spatial movement''.
 +
 
 +
To keep this self-contained and realistic, we will use a ''Stochastic Agent-Based Model (ABM)'' approach. This allows us to track individual "viral sequences" as they mutate and spread across a coordinate system.
 +
 
 +
The R simulation requires the following libraries:
 +
 
 +
: ''tidyverse'': For data manipulation and plotting.
 +
: ''gganimate'': For spatio-temporal visualization.
 +
: ''adegenet'' or ''ape'': For handling ''genomic'' distances.
 +
: ''sf'': for projecting this onto real-world maps.
 +
 
 +
 
 +
==== The Core Simulation Engine====
 +
 
 +
This script generates a synthetic population, simulates transmission based on proximity, and introduces ''mutations'' (modeled as bit-strings or distance matrices).
 +
 
 +
<pre>
 +
library(tidyverse)
 +
library(gganimate)
 +
 
 +
# --- Setup Parameters ---
 +
set.seed(42)
 +
n_agents <- 200
 +
map_size <- 100
 +
initially_infected <- 3
 +
mutation_rate <- 0.05  # Probability of a genomic shift per transmission
 +
timesteps <- 50
 +
 
 +
# --- Initialize Population ---
 +
pop <- tibble(
 +
  id = 1:n_agents,
 +
  x = runif(n_agents, 0, map_size),
 +
  y = runif(n_agents, 0, map_size),
 +
  state = "S",
 +
  genome = 0,        # Base viral lineage
 +
  parent_id = NA,
 +
  infect_time = NA
 +
)
 +
 
 +
# Infect patient zeros
 +
pop$state[1:initially_infected] <- "I"
 +
pop$infect_time[1:initially_infected] <- 0
 +
pop$genome[1:initially_infected] <- sample(1:100, initially_infected)
 +
 
 +
# --- Run Simulation ---
 +
history <- list()
 +
 
 +
for (t in 1:timesteps) {
 +
  # 1. Movement (Random Walk)
 +
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
 +
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
 +
 
 +
  # 2. Transmission
 +
  infected <- which(pop$state == "I")
 +
  susceptible <- which(pop$state == "S")
 +
 
 +
  for (i in infected) {
 +
    # Find neighbors in range
 +
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
 +
    contacts <- susceptible[dists < 5] # Transmission radius
 +
   
 +
    if (length(contacts) > 0) {
 +
      for (target in contacts) {
 +
        if (runif(1) < 0.3) { # Prob of infection
 +
          pop$state[target] <- "I"
 +
          pop$infect_time[target] <- t
 +
          pop$parent_id[target] <- pop$id[i]
 +
          # Genomic Mutation: Slight drift from parent genome
 +
          pop$genome[target] <- pop$genome[i] +  
 +
            sample(c(-1, 0, 1), 1, prob = c(mutation_rate/2, 1-mutation_rate, mutation_rate/2))
 +
        }
 +
      }
 +
    }
 +
  }
 +
 
 +
  # 3. Recovery
 +
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
 +
 
 +
  history[[t]] <- pop %>% mutate(time = t)
 +
}
 +
 
 +
sim_data <- bind_rows(history)
 +
</pre>
 +
 
 +
Here is the structure of the simulated dataset.
 +
 
 +
<pre>
 +
> str(sim_data )
 +
tibble [10,000 × 8] (S3: tbl_df/tbl/data.frame)
 +
  $ id        : int [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
 +
$ x          : num [1:10000] 89 92.5 28 83.5 65.4 ...
 +
$ y          : num [1:10000] 86.5 50.2 85.2 47.1 16.4 ...
 +
$ state      : chr [1:10000] "I" "I" "I" "S" ...
 +
$ genome    : num [1:10000] 80 53 24 0 0 0 0 0 0 0 ...
 +
$ parent_id  : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 +
$ infect_time: num [1:10000] 0 0 0 NA NA NA NA NA NA NA ...
 +
$ time      : int [1:10000] 1 1 1 1 1 1 1 1 1 1 ...
 +
</pre>
 +
 
 +
==== Genomic Tracking & Phylogeny====
 +
 
 +
To visualize the "Genomic Tracking," we treat the ''genome'' value as a proxy for a sequence.
 +
In a real scenario, this would be a high-dimensional Hamming distance, but here we can visualize the
 +
''Clade Divergence'' over time.
 +
 
 +
* Visualizing the Spatio-Temporal Spread: Using ''gganimate'', we can see how the virus clusters geographically and how the "Genomic Signature" (color) evolves.
 +
 
 +
<pre>
 +
anim <- ggplot(sim_data, aes(x = x, y = y, color = as.factor(genome), size = state)) +
 +
  geom_point(alpha = 0.7) +
 +
  scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
 +
  theme_minimal() +
 +
  labs(title = "Viral Spread & Genomic Drift: Time {frame}",
 +
      color = "Genomic Lineage",
 +
      size = "Infection Status") +
 +
  transition_time(time)
 +
 
 +
# animate(anim) # Uncomment to render
 +
</pre>
 +
 
 +
====Realistic Epidemiological Quantification====
 +
 
 +
To make this "highly realistic," we should calculate the Effective Reproduction Number
 +
(<math>R_t</math>) and the Spatio-Temporal Kernel
 +
3. Realistic Epidemiological Quantification
 +
 
 +
<math>R_t​=\frac{\text{New Infections at time}\ t}{\text{Infectious Individuals at time}\ t−1}.</math>​
 +
 
 +
 
 +
* Quantifying the Clusters: We can use a K-nearest neighbor (KNN) or DBSCAN algorithm on the ''sim_data'' to identify ''hotspots''
 +
where specific genomic lineages are dominating. This mimics real-world ''Genomic Surveillance'' reports.
 +
 
 +
<pre>
 +
# Example quantification of lineage prevalence
 +
library(dplyr)
 +
library(ggplot2)
 +
 
 +
lineage_summary_plot <- sim_data %>%
 +
    filter(state == "I") %>%
 +
    # Use group_by to organize the data by time and genome
 +
    group_by(time, genome) %>%
 +
    summarise(count = n(), .groups = "drop") %>%
 +
    ggplot(aes(x = time, y = count, fill = as.factor(genome))) +
 +
    geom_area() +
 +
    labs(title = "Genomic Lineage Succession", x = "Days", y = "Active Cases")
 +
 
 +
# Display the plot
 +
lineage_summary_plot
 +
</pre>
 +
 
 +
 
 +
To extend this code into an Rmd notebook, we can include
 +
 
 +
  1. Parameter Sweeping: Wrap the simulation in a function and run it with different  values.
 +
  2. Network Analysis: Use the `parent_id` column to create a transmission tree using the ''igraph'' package.
 +
  3. Kernel Density: Add a ''geom_density_2d()'' layer to the map to show spatial intensity.
 +
 
 +
Let's expand the simulation from a simple proxy to a high-fidelity ''Phylodynamic Model''
 +
by simulating a '''founder DNA sequence''' and allow it to evolve through point mutations (substitutions) as it spreads across the population.
 +
 
 +
* DNA Sequences & Mutation: Instead of a single number, each agent now carries a character string. When transmission occurs, there is a probability that one or more bases in the sequence will mutate.
 +
 
 +
<pre>
 +
library(tidyverse)
 +
library(igraph)
 +
library(ggraph)
 +
library(stringdist) # For Hamming Distance
 +
 
 +
# --- Configuration ---
 +
seq_length <- 50
 +
bases <- c("A", "C", "G", "T")
 +
mutation_rate <- 0.01 # Probability per base per transmission
 +
 
 +
# Create a starting sequence (Patient Zero)
 +
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")
 +
 
 +
# Helper function to mutate DNA
 +
mutate_dna <- function(sequence) {
 +
  seq_vec <- strsplit(sequence, "")[[1]]
 +
  mutation_mask <- runif(seq_length) < mutation_rate
 +
  if(any(mutation_mask)) {
 +
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
 +
  }
 +
  return(paste(seq_vec, collapse = ""))
 +
}
 +
 
 +
</pre>
 +
 
 +
* Genetic Distance Matrix: To quantify the "genomic tracking" aspect, we use the ''Hamming Distance'', which counts the number of positions at which the corresponding nucleotides are different.
 +
 
 +
<pre>
 +
# Extract all unique sequences found in the simulation
 +
unique_genomes <- unique(pop$genome)
 +
 
 +
# Calculate Hamming Distance Matrix
 +
dist_matrix <- stringdistmatrix(unique_genomes, unique_genomes, method = "hamming")
 +
rownames(dist_matrix) <- unique_genomes
 +
colnames(dist_matrix) <- unique_genomes
 +
 
 +
# This matrix allows us to see how far 'Variant B' has drifted from 'Variant A'
 +
</pre>
 +
 
 +
* Integrated Spatio-Temporal & Transmission Model: We will update the simulation to record the transmission pairs to build the `igraph` tree and add the spatial density layer.
 +
 
 +
<pre>
 +
# --- Initialization Fixes ---
 +
# Ensure genome column is a character vector to hold DNA strings
 +
pop$genome <- as.character(pop$genome)
 +
edges <- data.frame(from = integer(), to = integer())
 +
 
 +
for (t in 1:timesteps) {
 +
  # 1. Movement
 +
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
 +
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
 +
 
 +
  # 2. Transmission
 +
  infected <- which(pop$state == "I")
 +
  susceptible <- which(pop$state == "S")
 +
 
 +
  for (i in infected) {
 +
    # Check distances against only the susceptible population
 +
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
 +
    contacts <- susceptible[dists < 5]
 +
   
 +
    for (target in contacts) {
 +
      # Important: Check if target is still susceptible (in case they were infected
 +
      # by someone else earlier in this same timestep loop)
 +
      if (pop$state[target] == "S" && runif(1) < 0.3) {
 +
       
 +
        # Update Status
 +
        pop$state[target] <- "I"
 +
        pop$infect_time[target] <- t
 +
        pop$parent_id[target] <- pop$id[i]
 +
       
 +
        # 1. Expand Genomic Logic: DNA Mutation
 +
        # We replace the numeric addition with our string mutation function
 +
        new_sequence <- mutate_dna(pop$genome[i])
 +
        pop$genome[target] <- new_sequence
 +
       
 +
        # 3. Track Transmission for igraph
 +
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
 +
      }
 +
    }
 +
  }
 +
 
 +
  # 3. Recovery
 +
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
 +
 
 +
  history[[t]] <- pop %>% mutate(time = t)
 +
}
 +
 
 +
sim_data <- bind_rows(history)
 +
 
 +
# --- Visualization: Spatial Intensity ---
 +
ggplot(sim_data %>% filter(state == "I"), aes(x = x, y = y)) +
 +
  geom_density_2d_filled(
 +
    aes(fill = after_stat(level)),
 +
    alpha = 0.6,
 +
    contour_var = "ndensity"  # optional: normalize for better contrast
 +
  ) +
 +
  geom_point(aes(color = genome), alpha = 0.9) +
 +
  scale_fill_viridis_d() +      # discrete fill for the bands
 +
  theme_minimal() +
 +
  labs(title = "Viral Hotspots and Genomic Clustering") +
 +
  theme(legend.position = "none")
 +
 
 +
</pre>
 +
 
 +
* The Transmission Tree (Phylogeny): Using ''igraph'' and ''ggraph'', we can visualize the
 +
''Who-Infects-Whom'' network. This is the gold standard for contact tracing and understanding super-spreading events.
 +
 
 +
<pre>
 +
# Create graph object
 +
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
 +
 
 +
# Plot the tree
 +
ggraph(transmission_graph, layout = 'tree') +
 +
  geom_edge_diagonal(alpha = 0.2, color = "grey") +
 +
  geom_node_point(color = "steelblue", size = 2) +
 +
  theme_void() +
 +
  labs(title = "Reconstructed Transmission Tree")
 +
 
 +
# An alternative plot_ly() SVG infection graph can be created by
 +
library(plotly)
 +
library(igraph)
 +
library(dplyr)  # For piping and joins
 +
 
 +
# Extract layout coordinates (tree structure)
 +
layout <- layout_as_tree(transmission_graph)
 +
 
 +
# Prepare node data with safe column names
 +
node_data <- data.frame(
 +
  node_id = V(transmission_graph)$name, # Renamed from 'id'
 +
  x = layout[,1],
 +
  y = layout[,2]
 +
)
 +
 
 +
# Calculate in-degree (number of incoming edges) to identify root nodes
 +
in_degree <- degree(transmission_graph, mode = "in")
 +
 
 +
# Create nodes dataframe with size based on root status
 +
nodes <- node_data %>%
 +
  mutate(
 +
    size = ifelse(in_degree == 0, 10, 7), # Root nodes (in-degree=0) are larger
 +
    label = ifelse(in_degree == 0, "Root", "Descendant")
 +
  )
 +
 
 +
# Prepare edge data with directional arrows
 +
edges <- as_data_frame(transmission_graph, what = "edges") %>%
 +
  left_join(nodes, by = c("from" = "node_id")) %>%
 +
  left_join(nodes, by = c("to" = "node_id"), suffix = c("_from", "_to"))
 +
 
 +
# 1. Calculate children for each node
 +
children_info <- edges %>%
 +
  group_by(from) %>%
 +
  summarise(children = paste(to, collapse = ", ")) %>%
 +
  rename(node_id = from)
 +
 
 +
# 2. Calculate parents for each node
 +
parent_info <- edges %>%
 +
  group_by(to) %>%
 +
  summarise(parents = paste(from, collapse = ", ")) %>%
 +
  rename(node_id = to)
 +
 
 +
# 3. Join this info back to your nodes dataframe
 +
nodes <- nodes %>%
 +
  left_join(children_info, by = "node_id") %>%
 +
  left_join(parent_info, by = "node_id") %>%
 +
  mutate(
 +
    children = ifelse(is.na(children), "None", children),
 +
    parents = ifelse(is.na(parents), "None", parents)
 +
  )
 +
 
 +
# Create interactive plot
 +
transmission_plot <- plot_ly() %>%
 +
  add_segments(
 +
    data = edges,
 +
    x = ~x_from, y = ~y_from,
 +
    xend = ~x_to, yend = ~y_to,
 +
    line = list(
 +
      color = ~ifelse(x_from < x_to, "#1f77b4", "#ff7f0e"),
 +
      width = 2
 +
    ),
 +
    hoverinfo = "none"
 +
  ) %>%
 +
  add_trace(
 +
    data = nodes,
 +
    x = ~x,
 +
    y = ~y,
 +
    type = "scatter",
 +
    mode = "markers",
 +
    marker = list(
 +
      size = ~size,
 +
      opacity = 0.8,
 +
      line = list(width = 1, color = "black")
 +
    ),
 +
    # Added Parents and Children to the hover text
 +
    text = ~paste0(
 +
      "<b>Node ID:</b> ", node_id,
 +
      "<br><b>Type:</b> ", label,
 +
      "<br><b>Parent(s):</b> ", parents,
 +
      "<br><b>Children:</b> ", children
 +
    ),
 +
    hoverinfo = "text",
 +
    showlegend = FALSE
 +
  ) %>%
 +
  layout(
 +
    title = "Interactive Viral Transmission Tree (Graph)",
 +
    xaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
 +
    yaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
 +
    hovermode = "closest"
 +
  )
 +
 
 +
transmission_plot
 +
</pre>
 +
 
 +
* Genetic Distance: By calculating  (distance between sequence  and ), we can identify if a new "cluster" appearing in a specific city is a local evolution or a new introduction from outside.
  
-2 < LOD < 3 provides only weak (non-significant) evidence for or against linkage. LOD scores vary with $\theta$: calculate LOD scores for a range of $\theta's$, find one that maximizes Z($\theta$); vary with data: each pedigree gives different information; are additive across independent pedigrees: sum data from all pedigrees to get final Z score.
+
* Spatial Kernel: The ''geom_density_2d'' visualizes the spatial transmission kernel, identifying where the force of infection is highest.
  
 +
* Tree Topology: A ''star-like'' phylogeny suggests rapid exponential growth, while a ladder-like
 +
phylogeny suggests a stable, endemic transmission chain.
  
'''Example 1:''' Suppose the father and mother are both Dd (dd is the recessive affected individual, Dd the heterozygous carrier individual, and DD the homozygous normal individual). The table below shows the Mendelian ration of $\frac{3}{4}$ normal to $\frac{1}{4}$ affected. For most ''autosomal recessive diseases'', the heterozygote cannot be distinguished from the normal homozygote. In the normal phenotype categories of offspring (Dd and DD produce the same normal phenotype), two of the three are heterozygotes (carriers); one of the three is homozygous normal
+
<pre>
 +
# Proper Genetic Distance Matrix
 +
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
 +
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
  
<center>
+
# Optional: Visualize distance as a heatmap
{|class="wikitable" style="text-align:center;width:50%" Border="1"
+
image(as.matrix(dist_mat), main = "Genetic Distance Heatmap")
|-
 
| || '''Father'''
 
|-
 
|'''Mother''' || [[Image:SMHS_Epi_Figure_6.png|200px]]
 
|-
 
|}
 
</center>
 
  
 +
# SVG plot using plot_ly()
 +
library(plotly)
  
 +
# 1. Get the unique viral strains
 +
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
  
 +
# 2. Generate clean Short IDs (e.g., "Strain 1", "Strain 2", etc.)
 +
# Or use the first few characters if that's where the ID is hidden
 +
short_ids <- paste0("S-", seq_along(unique_strains))
  
<center>
+
# 3. Calculate the matrix using the full sequences
[[Image:SMHS_Epi_Figure_7.png|400px]]
+
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
</center>
+
m <- as.matrix(dist_mat)
This pedigree example illustrates autosomal recessive inheritance. I-1 and I-2 are unrelated but produced an affected offspring (affected offspring have normal parents). By chance, they both must have been carriers. Even though II-2 is affected, she produced no affected offspring (i.e., the phenotype appears in siblings, not parents). As the probable genotype for an outside individual (II-1) is homozygous normal, III-1, III-2 and III-3 must be carriers (heterozygotes). They are not affected but could only have inherited the recessive gene from II-2. Next, II-3, II-5, and II-6 each have a $\frac{2}{3}$ chance of being a carrier and a $\frac{1}{3}$ chance of being homozygous normal. They are not affected, but they are carrier*carrier offsprings. Like I-1, II-4 and II-7 have a high probability of being homozygous normal as they are outside the family. III-4, III-5, III-6, III-7, III-8, and III-9 all have a $\frac{1}{3}$ chance of being carriers and a $\frac{2}{3}$ chance of being homozygous normal. One parent of each is probably homozygous normal, the other has a $\frac{2}{3}$ chance of being a carrier and a 1 in 2 chance of passing on the recessive allele if they were a carrier.
 
  
 +
# 4. Handle Infinities
 +
m[is.infinite(m)] <- max(m[is.finite(m)], na.rm = TRUE)
  
'''Example 2:''' Linkage mapping using pedigrees is the disease linked to the marker given the pedigree below. (Dominant inheritance)
+
# 5. Assign the CLEAN IDs to the matrix
 +
rownames(m) <- short_ids
 +
colnames(m) <- short_ids
  
<center>
+
# 6. Plot
[[Image:SMHS_Epi_Figure_8.png|250px]]
+
fig <- plot_ly(
</center>
+
    z = m,
 +
    x = colnames(m),
 +
    y = rownames(m),
 +
    type = "heatmap",
 +
    colorscale = "Viridis",
 +
    # This customdata trick lets you see the FULL sequence when you hover!
 +
    customdata = matrix(unique_strains, nrow=nrow(m), ncol=ncol(m)),
 +
    hovertemplate = "Row: %{y}<br>Col: %{x}<br>Dist: %{z}<extra></extra>"
 +
  ) %>%
 +
  layout(
 +
    xaxis = list(title = "Strain ID", type = "category"),
 +
    yaxis = list(title = "Strain ID", type = "category", autorange = "reversed")
 +
  )
  
*Two problems: (1) we don’t know the phase, even if the genes are linked, we don’t know arrangement of alleles (cis or trans) on the chromosomes in Dad: D1 d2 or D2 d1. Solution: take the average of the likelihoods of linkage: L($\theta)=\frac{1}{2}L(\theta|phase 1)+\frac {1}{2}L(\theta|phase 2)$; (2) how can we compare the probability of linkage to the probability of no linkage. Solution: take the ratio (i.e. odds) of the likelihood of linkage [L($\theta$)=L(MLE of $\theta$)] versus the likelihood of no linkage [L($\theta$)=L($\theta$=0.5)].
+
fig
  
*Calculating LOD scores: (1) if the phase is D1 d2, then there are 4 non-recombinants and 1 recombinant, L($\theta$)=(1-$\theta)^{4}\theta$; (2) if the phase is D2 d1, then there are 4 recombinants and 1 non-recombinant, L($\theta)=\theta^{4}(1-\theta)$.
+
</pre>
  
For $\theta=0.1(10cm)$, phase $1,L(\theta)=(1-0.1)^{4}*0.1=0.9^{4}*0.1$, for phase 2,$\theta =0.1:L(θ)=0.1^{4} (1-0.1)=0.1^{4}*0.9.$ At $\theta=0.5, L (\theta)=0.5^{5}$, $Z(\theta)=log_{10}\frac{{[0.9^{4}0.1+0.1^{4}0.9]}{2}}{0.5^{5}} = 0.0217.$
+
* Finally, display the transmission tree graph.
  
For other values of $\theta$, do the similar calculation: so the MLE of $\theta$, $\hat{\theta}= 0.20$.
+
<pre>
 +
inf_graph <- graph_from_data_frame(edges, directed = TRUE)
  
 +
ggraph(inf_graph, layout = 'kk') +
 +
  geom_edge_link(alpha = 0.3, arrow = arrow(length = unit(2, 'mm'))) +
 +
  geom_node_point(color = "red", size = 1) +
 +
  theme_void() +
 +
  labs(title = "Epidemiological Transmission Network")
 +
</pre>
  
 +
=== Applications and Software ===
 +
Modern epidemiology relies heavily on computational tools.
 +
* Key R Packages: ''epiR'' (Epi measures), ''genetics'' (HWE, LD), ''survival'' (time-to-event), ''qqman'' (GWAS visualization).
 +
* Online Tools: [https://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html SOCR Distribution Tables].
  
 +
=== Problems ===
  
 +
==== Problem 1: Linkage Mapping ====
 +
'''Scenario''': Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction <math>\theta</math>.
 
<center>
 
<center>
{|class="wikitable" style="text-align:center;width:50%" Border="1"
+
[[Image:SMHS_Epi_Figure_8.png|250px]]
 +
</center>
 +
 
 +
# Calculate LOD Scores: 
 +
Using Maximum Likelihood Estimation (MLE), if the phase is known and the data include 4 non-recombinant and 1 recombinant meioses, the likelihood is: 
 +
:<math>L(\theta) = (1-\theta)^4 \theta</math>. 
 +
The LOD score is defined as: 
 +
:<math>Z(\theta) = \log_{10}\left(\frac{L(\theta)}{L(0.5)}\right)</math>, where <math>L(0.5) = (0.5)^5 = 1/32</math>.
 +
 
 +
# Corrected Result Table:
 +
{|class="wikitable" style="text-align:center;width:50%" border="1"
 
|-
 
|-
|$\theta$ || $L(\theta$) || $L=(\theta=0.5)$ || $Z(\theta)$
+
! <math>\theta</math> !! <math>Z(\theta)</math>
|-
 
|0 ||0 ||0.03125|| $-\infty$
 
|-
 
|0.05 ||0.02037|| 0.03125 ||-0.18586
 
 
|-
 
|-
|0.10 ||0.03285|| 0.03125|| 0.02169
+
| 0.0 || <math>-\infty</math>
 
|-
 
|-
|0.15 ||0.03937 ||0.03125 ||0.10032
+
| 0.10 || 0.322
 
|-
 
|-
|'''0.20''' ||'''0.0416''' ||'''0.03125''' ||'''0.12424'''
+
| 0.20 || '''0.418''' (Max LOD)
 
|-
 
|-
|0.25|| 0.04102 ||0.03125|| 0.11815
+
| 0.50 || 0.0
|-
+
|}
|0.30 ||0.0385|| 0.03125 ||0.09454
+
The maximum LOD score occurs at <math>\hat{\theta} = 0.20</math>, consistent with 1 recombinant out of 5 informative meioses. 
|-
+
> Note: A LOD > 3 is conventionally required for significant linkage; this example is illustrative.
|}
+
 
</center>
+
==== Problem 2: Number Needed to Treat (NNT) and Harm (NNH) ====
 +
'''Scenario''': A trial reports 800 events among 1000 patients in Treatment Group A and 600 events among 1200 patients in Control Group B.
 +
 
 +
# Event rates: 
 +
:<math>p_A = 800/1000 = 0.80</math>, <math>p_B = 600/1200 = 0.50</math>.
 +
 
 +
# Absolute Risk Increase (ARI): 
 +
:<math>\text{ARI} = p_A - p_B = 0.30</math>.
 +
 
 +
# Since the treatment increases risk, we compute the Number Needed to Harm (NNH): 
 +
:<math>\text{NNH} = \frac{1}{\text{ARI}} = \frac{1}{0.30} \approx 3.33</math>.
 +
 
 +
'''Interpretation''': For every 3–4 patients treated with intervention A (vs. control), one additional adverse event occurs. The negative NNT is conventionally reported as NNH.
 +
 
 +
==== Problem 3: GWAS Power Analysis (R) ====
 +
'''Scenario''': Simulate a case-control GWAS with 500 cases and 500 controls, Minor Allele Frequency (MAF) = 0.2, and Odds Ratio (OR) = 1.5. Estimate statistical power at <math>\alpha = 0.05</math>.
 +
 
 +
The following R function correctly simulates genotypes separately for cases and controls based on genotype relative risks derived from the OR:
 +
 
 +
  gwas_power_case_control <- function(n_cases = 200, n_controls = 200,
 +
                                  maf = 0.2, OR = 1.5,
 +
                                  alpha = 0.05, n_sims = 100) {
 +
  # Genotype coding: 0, 1, 2 copies of effect allele
 +
  # Relative risks under additive log-odds model: 1, OR, OR^2
 +
  rr <- c(1, OR, OR^2)
 +
 
 +
  # Genotype frequencies in controls (Hardy-Weinberg)
 +
  g_ctrl <- dbinom(0:2, size = 2, prob = maf)
 +
 
 +
  # Genotype frequencies in cases (proportional to g_ctrl * rr)
 +
  g_case_unscaled <- g_ctrl * rr
 +
  g_case <- g_case_unscaled / sum(g_case_unscaled)
 +
 
 +
  significant <- numeric(n_sims)
 +
 
 +
  for (i in 1:n_sims) {
 +
    # Simulate genotypes
 +
    geno_cases <- sample(0:2, n_cases, replace = TRUE, prob = g_case)
 +
    geno_ctrls <- sample(0:2, n_controls, replace = TRUE, prob = g_ctrl)
 +
   
 +
    geno <- c(geno_cases, geno_ctrls)
 +
    status <- c(rep(1, n_cases), rep(0, n_controls))
 +
   
 +
    # Fit logistic regression
 +
    model <- glm(status ~ geno, family = binomial)
 +
    p_val <- summary(model)$coefficients[2, 4]
 +
    significant[i] <- as.numeric(p_val < alpha)
 +
  }
 +
 
 +
  return(mean(significant))  # Estimated power
 +
  }
 +
 
 +
  # Example usage:
 +
  set.seed(2026)
 +
  power_est <- gwas_power_case_control()
 +
  cat("Estimated power:", round(power_est, 3), "\n")
 +
 
 +
==== Problem 4: Hardy-Weinberg Equilibrium (HWE) Testing ====
 +
'''Scenario''': In a GWAS control group (<math>n = 1000</math>), genotype counts for a SNP are: 
 +
AA = 240, Aa = 120, aa = 40. Test HWE using an exact test.
 +
 
 +
 
 +
  # Install if needed: install.packages("HardyWeinberg")
 +
  library(HardyWeinberg)
 +
 
 +
  # Observed counts: AA, Aa, aa
 +
  obs <- c(240, 120, 40)
 +
 
 +
  # Exact HWE test
 +
  hwe_test <- HWExact(obs, verbose = FALSE)
 +
  p_val <- hwe_test$pval
 +
 
 +
  cat("HWE exact test p-value:", format.pval(p_val, digits = 4), "\n")
 +
 
 +
  # Interpretation: p < 0.001 suggests deviation from HWE (common QC threshold in GWAS)
 +
 
 +
==== Problem 5: Assessing Confounding in Observational Data ====
 +
'''Scenario''': Evaluate whether age and sex confound the association between an exposure and disease outcome.
 +
 
 +
We'll use the ''Logistic Distribution'' function '''plogis''', which provides the standard Density, distribution function, quantile function and random generation for the logistic distribution with parameters ''location'' and ''scale'', and '''rbinom'''.
 +
 
 +
<pre>
 +
  set.seed(123)
 +
  n <- 5000
 +
  # Simulate confounders and exposure
 +
 
 +
  age <- rnorm(n, mean = 50, sd = 10)
 +
  sex <- rbinom(n, 1, 0.5)
 +
  exposure <- rbinom(n, 1, plogis(-1 + 0.02 * age + 0.3 * sex))
 +
 
 +
  # Simulate disease influenced by exposure and confounders
 +
  disease <- rbinom(n, 1, plogis(-2 + 0.9 * exposure + 0.03 * age + 0.4 * sex))
 +
 
 +
 
 +
  # Crude model
 +
  crude_or <- exp(coef(glm(disease ~ exposure, family = binomial))["exposure"])
 +
 
 +
 
 +
  # Adjusted model
 +
  adj_or <- exp(coef(glm(disease ~ exposure + age + sex, family = binomial))["exposure"])
 +
 
 +
  # Percent change due to confounding
 +
  confounding_pct <- (crude_or - adj_or) / crude_or * 100
 +
  cat("Crude OR:", round(crude_or, 2), "\n")
 +
  cat("Adjusted OR:", round(adj_or, 2), "\n")
 +
  cat("Percent change:", round(confounding_pct, 1), "%\n")
 +
 
 +
  if (abs(confounding_pct) > 10) {
 +
  cat("→ Substantial confounding detected (change > 10%).\n")
 +
  } else cat("→ No substantial confounding detected (change < 10%).\n")
 +
</pre>
 +
 
 +
==== Problem 6: Diagnostic Test Accuracy via ROC Analysis ====
 +
'''Scenario''': A continuous biomarker is measured in 100 diseased and 100 non-diseased individuals. Determine the optimal cutoff, sensitivity, and specificity using [https://doi.org/10.1002/bimj.200710415 Youden’s index].
 +
 
 +
<pre>
 +
  # Install if needed: install.packages("pROC")
 +
  library(pROC)
 +
 
 +
  # Simulate biomarker scores
 +
  set.seed(42)
 +
  score_healthy <- rnorm(100, mean = 0, sd = 1)
 +
  score_diseased <- rnorm(100, mean = 1.5, sd = 1)
 +
 
 +
  score <- c(score_healthy, score_diseased)
 +
  status <- c(rep(0, 100), rep(1, 100))
 +
 
 +
  # ROC analysis
 +
  roc_curve <- roc(status, score)
 +
  auc_val <- auc(roc_curve)
 +
 
 +
  # Optimal threshold (Youden's J statistic)
 +
  opt <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity"))
 +
 
 +
  cat("AUC:", round(auc_val, 3), "\n")
 +
  cat("Optimal cutoff:", round(opt[["threshold"]], 2), "\n")
 +
  cat("Sensitivity:", round(opt[["sensitivity"]], 3), "\n")
 +
  cat("Specificity:", round(opt[["specificity"]], 3), "\n")
 +
</pre>
 +
 
 +
====Problem 7====
 +
 
 +
Try to integrate structural fixes, biological realism (fitness selection), and technical optimizations into the above genetics simulation of ''Phylodynamics and Genomic Surveillance in R''.
 +
Try to demonstrates how viral mutations, spatial movement, and natural selection interact to shape an outbreak.
 +
 
 +
The Core Simulation Engine can initialize a population and simulates a ''Random Walk'' movement model,
 +
by introducing a Fitness Function; mutations at specific genomic loci now increase the transmission probability, simulating the emergence of a ''Variant of Concern.''
 +
 
 +
<pre>
 +
library(tidyverse)
 +
library(gganimate)
 +
library(stringdist)
 +
library(igraph)
 +
library(ggraph)
 +
library(plotly)
 +
 
 +
# --- Setup Parameters ---
 +
set.seed(42)
 +
n_agents <- 200
 +
map_size <- 100
 +
initially_infected <- 3
 +
mutation_rate <- 0.02
 +
timesteps <- 60
 +
seq_length <- 50
 +
bases <- c("A", "C", "G", "T")
 +
 
 +
# --- Helper Functions ---
 +
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")
 +
 
 +
mutate_dna <- function(sequence) {
 +
  seq_vec <- strsplit(sequence, "")[[1]]
 +
  mutation_mask <- runif(seq_length) < mutation_rate
 +
  if(any(mutation_mask)) {
 +
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
 +
  }
 +
  return(paste(seq_vec, collapse = ""))
 +
}
 +
 
 +
calc_fitness <- function(sequence) {
 +
  gc_content <- nchar(gsub("[AT]", "", sequence)) / nchar(sequence)
 +
  return(0.3 + (gc_content * 0.2))
 +
}
 +
 
 +
# --- Initialize Population ---
 +
pop <- tibble(
 +
  id = 1:n_agents,
 +
  x = runif(n_agents, 0, map_size),
 +
  y = runif(n_agents, 0, map_size),
 +
  state = "S",
 +
  genome = founder_seq,
 +
  parent_id = NA_integer_,
 +
  infect_time = NA_real_
 +
)
 +
 
 +
# Set state to "I" for initial cases
 +
pop$state[1:initially_infected] <- "I"
 +
pop$infect_time[1:initially_infected] <- 0
 +
 
 +
# --- Run Simulation ---
 +
history <- list()
 +
edges <- data.frame(from = integer(), to = integer())
 +
 
 +
for (t in 1:timesteps) {
 +
 
 +
  # Assignment operator (<-) to update movement
 +
  # loaded cyclical harmonics
 +
  # pop$x <- pop$x * rep(0.1 + cos(10*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
 +
  # pop$y <- pop$y * rep(0.1 + sin(5*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
 +
 
 +
  # simple stochastic dynamics
 +
  pop$x <- pop$x + rnorm(n_agents, 0, 2)
 +
  pop$y <- pop$y + rnorm(n_agents, 0, 2)
 +
 
 +
  # Optional: Keep agents within map boundaries
 +
  pop$x <- pmax(0, pmin(map_size, pop$x))
 +
  pop$y <- pmax(0, pmin(map_size, pop$y))
 +
 
 +
  # 2. Transmission
 +
  infected_idx <- which(pop$state == "I")
 +
 
 +
  for (i in infected_idx) {
 +
    susceptible_idx <- which(pop$state == "S")
 +
    if(length(susceptible_idx) == 0) break
 +
   
 +
    dists <- sqrt((pop$x[susceptible_idx] - pop$x[i])^2 + (pop$y[susceptible_idx] - pop$y[i])^2)
 +
    contacts <- susceptible_idx[dists < 6]
 +
   
 +
    for (target in contacts) {
 +
      if (pop$state[target] == "S" && runif(1) < calc_fitness(pop$genome[i])) {
 +
        pop$state[target] <- "I"
 +
        pop$infect_time[target] <- t
 +
        pop$parent_id[target] <- pop$id[i]
 +
        pop$genome[target] <- mutate_dna(pop$genome[i])
 +
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
 +
      }
 +
    }
 +
  }
 +
 
 +
  # subsetting and assignment for Recovery
 +
  recovery_idx <- which(pop$state == "I" & (t - pop$infect_time) >= 10)
 +
  if(length(recovery_idx) > 0) {
 +
    pop$state[recovery_idx] <- "R"
 +
  }
 +
 
 +
  history[[t]] <- pop %>% mutate(time = t)
 +
}
 +
 
 +
sim_data <- bind_rows(history)
 +
# str(sim_data)
 +
</pre>
 +
 
 +
2. Spatio-Temporal Visualization: By mapping the genomic strings to colors, we can visualize the "clade emergence" over time.
 +
 
 +
<pre>
 +
anim <- ggplot(sim_data, aes(x = x, y = y, color = genome, size = state)) +
 +
geom_point(alpha = 0.6) +
 +
scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
 +
theme_minimal() +
 +
theme(legend.position = "none") +
 +
labs(title = "Viral Spread & Selection Pressure: Day {frame}") +
 +
transition_time(time)
 +
 
 +
# animate(anim)
 +
 
 +
</pre>
 +
 
 +
3. Genomic Tracking & Distance Matrices: To understand how much the virus has drifted, we use the **Hamming Distance**. This calculates the number of base substitutions between any two viral samples.
 +
 
 +
<pre>
 +
# Extract final strains
 +
final_strains <- sim_data %>%
 +
filter(state %in% c("I", "R"), !is.na(infect_time)) %>%
 +
distinct(genome) %>%
 +
pull(genome)
 +
 
 +
dist_mat <- stringdistmatrix(final_strains, final_strains, method = "hamming")
 +
rownames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
 +
colnames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
 +
 
 +
# Visualize Genetic Distance
 +
plot_ly(z = ~dist_mat, x = colnames(dist_mat), y = rownames(dist_mat), type = "heatmap", colorscale = "Viridis") %>%
 +
layout(title = "Genetic Hamming Distance Matrix")
 +
</pre>
 +
 
 +
4. Phylogeny: The Transmission Tree: While a distance matrix shows "how much" change occurred, the Transmission Tree shows "who infected whom." A "star-like" tree indicates rapid expansion, while a "ladder-like" tree indicates a slow, constant chain of transmission.
 +
 
 +
<pre>
 +
# Reconstruct the tree using igraph
 +
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
 +
 
 +
ggraph(transmission_graph, layout = 'tree') +
 +
geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
 +
geom_node_point(aes(color = name), size = 3, show.legend = FALSE) +
 +
theme_void() +
 +
labs(title = "Reconstructed Epidemiological Transmission Tree")
 +
</pre>
 +
 
 +
5. Quantifying the Molecular Clock: A key concept in phylodynamics is the ''Molecular Clock'', the idea that mutations accumulate at a relatively constant rate over time.
 +
 
 +
<pre>
 +
# Calculate divergence from founder
 +
clock_data <- sim_data %>%
 +
filter(!is.na(infect_time)) %>%
 +
mutate(dist_from_founder = stringdist(genome, founder_seq, method = "hamming"))
 +
 
 +
ggplot(clock_data, aes(x = infect_time, y = dist_from_founder)) +
 +
geom_jitter(alpha = 0.2, color = "steelblue") +
 +
geom_smooth(method = "lm", color = "red") +
 +
theme_minimal() +
 +
labs(title = "The Molecular Clock: Mutation Accumulation",
 +
x = "Time of Infection (Days)",
 +
y = "Hamming Distance from Founder")
 +
</pre>
 +
 
 +
Here are a couple of alternative plots ...
  
*'''Linkage Disequilibrium'''
+
<pre>
**Linkage vs. Linkage Disequilibrium: linkage refers to the observation that two loci are inherited together (rather than be separated by recombination) in a single generation; linkage disequilibrium refers to the pattern of correlation between loci at the population level.
+
library(ggraph)
**Linkage and Association: linkage is the relationship between loci, and is examined within families; association is the relationship between alleles and is examined within populations.
+
library(ggplot2)
**Linkage Disequilibrium (LD): describes the tendency of alleles to be inherited together more often than would be expected under random segregation. Extend of LD reflects the population’s history and the distance between markers. LD mapping is a promising approach for mapping genes, especially for complex-trait diseases. It is a population-based concept (not an individual or family-based concept); it has expected and observed values: looks at haplotypes instead of genotypes, observed frequencies are for haplotypes, expected haplotypes frequencies are calculated from allele frequencies.
+
library(plotly)
***Forces affecting LD: (1) recombination: breaks up allelic association; (2) gene conversion: during recombination, DNA sequence information is transferred from one chromatid to another; (3) recurrent mutation: same mutation arises on different haplotype backgrounds; (4) natural selection: keeps pairs of genes/SNPs together; (5) demographic history: migration, non-random mating.
 
**Linkage equilibrium:$p_{ab}=p_{a} p_{b}=(1-p_{A})(1-p_{B})$ ; $p_{AB}=p_{A} p_{B}$; $p_{Ab}=p_{A} p_{b}=p_{A} (1-p_{B})$; $p_{aB}= p_{a}p_{B} = (1-p_{A})p_{B};$
 
**Linkage disequilibrium:$p_{ab}\not=p_{a} p_{b}=(1-p_{A})(1-p_{B})$ ; $p_{AB}=p_{A} p_{B}$; $p_{Ab}=p_{A} p_{b}=p_{A} (1-p_{B})$; $p_{aB}= p_{a}p_{B} = (1-p_{A})p_{B};$
 
**Measures of LD: fundamental measure: Disequilibrium coefficient (D); most commonly used measures: D’ and |D’|; $r^{2}$ or $\Delta^{2}$.
 
***$D_{AB}$ is the disequilibrium coefficient for locus A and locus B. D is hard to interpret.
 
  
$D_{AB}=p_{AB}-p_{A} p_{B};$   $p_{AB}=p_{A} p_{B}+D_{AB};$  $D_{aB}=p_{a} p_{B}-D_{AB};$   $D_{ab}=p_{a} p_{b}+D_{AB};$
+
# Create the ggraph plot
 +
gg <- ggraph(transmission_graph, layout = 'tree') +
 +
   geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
 +
  geom_node_point(aes(color = name,
 +
                      text = paste("ID:", name,
 +
                                  "<br>Click for details")),
 +
                  size = 5) +
 +
  theme_void() +
 +
   labs(title = "Reconstructed Epidemiological Transmission Tree")
  
$D'_{AB}=\begin{cases}\frac{D_{AB}}{min(p_{A} p_{B},p_{a} p_{b})},D_{AB} \\\frac{D_{AB}} {min(p_{A} p_{b},p_{a} p_{B})},D_{AB}>0,\end{cases}$
+
# Convert to interactive plotly
it ranges between -1 and +1, when allele frequencies are small, D' is more likely to take extreme values, D’ of -1 or +1 implies that least one potential haplotype was not observed (no evidence for recombination between markers).
+
ggplotly(gg, tooltip = "text") %>%
 +
  layout(
 +
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
 +
    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
 +
    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
 +
  )
 +
</pre>
  
 +
and ...
  
$r^{2}$  or $\Delta^{2}=\frac{D_{AB}}{p_{A}(1-p_{A})p_{B}(1-p_{B})}$ it ranges between 0 and 1; $r^{2}=1$ when the markers provide identical information; $r^{2}= 0$ when the markers are in perfect equilibrium; not as strongly affected by extreme allele frequencies.
+
<pre>
 +
library(plotly)
 +
library(igraph)
 +
library(networkD3)
  
 +
# Convert igraph to networkD3 format
 +
transmission_d3 <- igraph_to_networkD3(transmission_graph)
  
Information from measurements: when D’ is high and $r^{2}$  is high indicates the tendency toward presence of only 2 haplotypes, with similar allele frequencies of the 2 loci; when D’ is high and $r^{2}$ is low indicates the tendency toward presence of only 3 haplotypes (for example, a young SNP ancestrally), with large difference in allele frequencies of the 2 loci; when D’ is low and $r^{2}$ is low indicates the tendency toward presence of all 4 haplotypes and random coupling of alleles.
+
# Prepare nodes with tooltips
 +
nodes_d3 <- transmission_d3$nodes
 +
if(exists("pop")) {
 +
  nodes_d3 <- nodes_d3 %>%
 +
    left_join(
 +
      pop %>%
 +
        mutate(name = as.character(id)) %>%
 +
        select(name, state, infect_time, genome),
 +
      by = "name"
 +
    )
 +
} else {
 +
  nodes_d3$state <- "Unknown"
 +
  nodes_d3$infect_time <- NA
 +
  nodes_d3$genome <- NA
 +
}
  
*Disequilibrium will decay each generation in a large population, after t generations: $D_{t}=(1-\theta)^{t} D_{0}$.
+
# Create custom tooltips
 +
nodes_d3$tooltip <- sprintf(
 +
  "ID: %s<br>State: %s<br>Infection Time: %s<br>Genome: %s",
 +
  nodes_d3$name,
 +
  nodes_d3$state,
 +
  ifelse(is.na(nodes_d3$infect_time), "N/A", nodes_d3$infect_time),
 +
  ifelse(is.na(nodes_d3$genome), "N/A", substr(nodes_d3$genome, 1, 15))
 +
)
  
 +
# Create the force-directed network
 +
forceNetwork(
 +
  Links = transmission_d3$links,
 +
  Nodes = nodes_d3,
 +
  Source = 'source',
 +
  Target = 'target',
 +
  NodeID = 'name',
 +
  Group = 'state',
 +
  Nodesize = 5,
 +
  opacity = 0.9,
 +
  zoom = TRUE,
 +
  fontSize = 14,
 +
  linkDistance = 150,
 +
  charge = -100,
 +
  opacityNoHover = 0.5,
 +
  legend = TRUE,
 +
  bounded = TRUE,
 +
  clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
 +
)
 +
</pre>
  
*'''Genetic Testing Issues'''
+
and ...
**Genetic Test: analysis of chromosomes, genes and/or gene products (proteins) to determine whether there is an abnormality present that is causing or will cause a genetic condition/disorder.
 
**Proportion of Genes Shared: first degree relatives (50%): sibs, parents, children,  dizygotic twins; second degree relatives (25%): uncles, aunts, nieces, nephews, grandparents, grandchildren, half-sibs, double first cousins; third degree relatives (12.5%): first cousins, half-uncles/aunts, half-nieces/nephews.
 
**Genetic tests are different from other medical tests:
 
**Medical tests done to detect a current medical condition. It may be done on healthy individuals to determine future risk for a genetic condition (predictive tests).
 
**Genetic test may need to be performed on affected family member before patient can be tested.
 
**Genetic test results may have implications for healthcare and life decisions of other family members.
 
**Possible insurance implications and potential for stigmatization and discrimination.
 
**Genetic tests currently offered on a population level: newborn screening; multiple marker screening for pregnant women to screen for increased risk for chromosome abnormalities (e.g. Down syndrome) and neural tube defects (e.g. spina bifida); Cystoic fibrosis screening offered preconceptionally and to pregnant couples; carrier testing for Tay-Sachs disease, Sickle cell anemia; Prenatal testing offered to women 35 and older.
 
**Genetic testing is currently offered to individuals with symptoms consistent with a genetic condition to establish, confirm or rule out a diagnosis; family history of a genetic condition.
 
*Uses of Genetic Testing:
 
**Diagnostic Testing: used to establish or confirm a diagnosis in a patient who has symptoms suggestive of a genetic condition.
 
**Predictive testing: (1) presymptomatic, eventual development of symptoms is certain when the gene mutation is present; (2) predispositional (eventual development of symptoms is likely but not certain when the gene mutation is present, e.g., breast cancer.
 
**Carrier testing: offered to patients based on family history of a genetic condition or ethnicity.
 
**Prenatal testing: offered during pregnancy to assess fetal status when there is an increased risk of having a child with a genetic condition due to maternal age, family history, ethnicity, abnormal screening test or ultrasound evaluation.
 
**Preimplantation genetic diagnosis (PGD): used with IVF and involves genetic testing on early embryos prior to implantation to rule out a genetic condition.
 
**Newborn screening: screen for genetic disorders which can result in severe medical problems, metal retardation or even death in newborns. Some of the conditions can benefit form early treatment.
 
**Pharmacogenetic testing: to identify gene changes in metabolic pathways that determine drug response, both for therapeutic effects and adverse effects. Examples may be genetic testing for CYP450.
 
**Prognostic testing.
 
**Forensic testing.
 
**Identity testing: who is the parent.
 
  
*Genetic Testing Challenges:
+
<pre>
**Testing for many genetic conditions not yet standard of care. Few practice guidelines exist.
+
library(visNetwork)
**Laboratories may offer different testing, even for the same genetic condition.
+
library(igraph)
**Many genetic tests are costly, low detection rate.
+
library(tidyverse)
**Genetic test results may be uninterpretable.
 
**Moving target; rapid evolution of information.
 
  
 +
# 1. Create the graph object
 +
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
  
 +
# 2. Prepare nodes data
 +
# Get vertex names
 +
node_names <- V(transmission_graph)$name
  
*'''Genetic Association Studies:''' an observational study that tests for a statistically significant correlation between a genetic marker (the exposure) and a phenotype (the outcome).
+
# Merge with pop data for additional information
*Genetic markers and phenotypes
+
nodes <- data.frame(
**Genetic marker = any measurable genetic, polymorphism: varies across individuals, groups, or populations can occur in coding or non-coding regions of the genome.
+
  id = node_names,
**Phenotype = any measurable trait: quantitative (height, blood pressure, glucose levels); qualitative (heart disease, cancer, hair color).
+
  label = node_names,
*Genetic association studies can be planned and analyzed using the QMSSI framework: question, measure, sampling, statistics, and inferences.
+
  title = NA, # Will be filled with tooltip info
*Study question: need to identify the population of interest (age, race, gender, geographic), the phenotype being studied and whether the study will look at specific genes with biological/positional relevance or agnostically search for new genomic regions of interest.
+
  stringsAsFactors = FALSE
*Measures: methods for measuring both must be characterized (Valid? Reliable?); must be described (quantitative phenotypes such as normally distributed, mean, variance and qualitative phenotypes such as blood pressure vs. hypertensive).
+
)
*Sampling: family based (twin studies, heritability studies, linkage studies) vs. population based sampling
 
*Statistics: statistical tests for genetic association studies largely determined by type of sampling and type of outcome.
 
*Inferences: association is not sufficient to prove causation. A positive statistical finding does not definitively mean the polymorphism tested is causing the outcome. A statistical association may be a result of direct causal relationship between SNP and outcome; confounding (linkage disequilibrium; population stratification); spurious association, false positive result.
 
  
*'''Gene-environment and Gene-gene Interactions'''
+
# Join with pop data to get metadata for tooltips
*Genes and Environment: almost all disease have a genetic component and an environmental component. How do these components interact? How to account for both effects?
+
if(exists("pop")) {
*Model 1: Neither the genotype nor the environment along increase risk
+
  nodes <- nodes %>%
 +
    left_join(
 +
      pop %>%
 +
        mutate(id = as.character(id)) %>%
 +
        select(id, state, infect_time, genome),
 +
      by = "id"
 +
    ) %>%
 +
    mutate(
 +
      title = paste0(
 +
        "<b>Agent ID:</b> ", id, "<br>",
 +
        "<b>State:</b> ", state, "<br>",
 +
        "<b>Infection Time:</b> ", infect_time, "<br>",
 +
        "<b>Genome:</b> ", ifelse(!is.na(genome), substr(genome, 1, 20), "N/A")
 +
      )
 +
    )
 +
} else {
 +
  # Fallback if pop doesn't exist
 +
  nodes$title <- paste0("<b>Agent ID:</b> ", nodes$id)
 +
}
  
<center>
+
# 3. Prepare edges data
[[Image:SMHS Epi Figure 9.png |400 px]]
+
edges_df <- igraph::as_data_frame(transmission_graph, what = "edges")
</center>
+
edges_vis <- data.frame(
 +
  from = edges_df$from,
 +
  to = edges_df$to,
 +
  arrows = "to",
 +
  color = list(color = "gray", opacity = 0.6),
 +
  width = 1,
 +
  smooth = TRUE
 +
)
  
 +
# 4. Optional: Add colors based on state or infection time
 +
if("state" %in% colnames(nodes)) {
 +
  # Create color mapping for different states
 +
  unique_states <- unique(nodes$state)
 +
  colors <- viridisLite::viridis(length(unique_states))
 +
  color_map <- setNames(colors, unique_states)
 +
 
 +
  nodes$color <- sapply(nodes$state, function(s) {
 +
    if(is.na(s)) "#808080" else color_map[s]
 +
  })
 +
} else if("infect_time" %in% colnames(nodes)) {
 +
  # Color by infection time
 +
  min_time <- min(nodes$infect_time, na.rm = TRUE)
 +
  max_time <- max(nodes$infect_time, na.rm = TRUE)
 +
 
 +
  # Normalize infection time for color scale
 +
  nodes$value <- nodes$infect_time
 +
  nodes$color <- viridisLite::viridis(100)[
 +
    cut(nodes$infect_time, 100, labels = FALSE)
 +
  ]
 +
} else {
 +
  nodes$color <- "#2C3E50"  # Default color
 +
}
  
*Model 2: The genotype exacerbates the effect of the risk factor
+
# 5. Create the interactive network
 +
visNetwork(nodes, edges_vis,
 +
          main = "Interactive Transmission Tree",
 +
          submain = "Hover over nodes for details") %>%
 +
  visNodes(
 +
    shape = "dot",
 +
    size = 15,
 +
    borderWidth = 2,
 +
    borderWidthSelected = 3,
 +
    shadow = list(enabled = TRUE, size = 10)
 +
  ) %>%
 +
  visEdges(
 +
    shadow = TRUE,
 +
    smooth = list(type = "curvedCW", roundness = 0.1)
 +
  ) %>%
 +
  visOptions(
 +
    highlightNearest = list(
 +
      enabled = TRUE,
 +
      degree = 1,
 +
      hover = TRUE,
 +
      algorithm = "hierarchical"
 +
    ),
 +
    nodesIdSelection = list(
 +
      enabled = TRUE,
 +
      values = nodes$id,
 +
      style = 'width: 150px; height: 26px'
 +
    )
 +
  ) %>%
 +
  visInteraction(
 +
    hover = TRUE,
 +
    hoverConnectedEdges = TRUE,
 +
    tooltipDelay = 0,
 +
    navigationButtons = TRUE,
 +
    dragNodes = TRUE,
 +
    dragView = TRUE,
 +
    zoomView = TRUE
 +
  ) %>%
 +
  visPhysics(
 +
    stabilization = TRUE,
 +
    solver = "forceAtlas2Based",
 +
    forceAtlas2Based = list(
 +
      gravitationalConstant = -50,
 +
      centralGravity = 0.01,
 +
      springLength = 100,
 +
      springConstant = 0.08,
 +
      damping = 0.4,
 +
      avoidOverlap = 1
 +
    )
 +
  ) %>%
 +
  visLayout(randomSeed = 123)  # For reproducible layout
 +
</pre>
  
<center>
+
As well as ...
[[Image:SMHS Epi Figure 11.png| 500 px]]
 
</center>
 
  
 +
<pre>
 +
# Alternative using plotly's network graph
 +
library(networkD3)
  
*Model 3: The risk factor exacerbates the effect of the genotype.
+
# Convert to D3 format
 +
transmission_d3 <- igraph_to_networkD3(transmission_graph)
  
<center>
+
# Add node attributes
[[Image:SMHS Epi Figure 12.png| 500 px]]
+
transmission_d3$nodes <- transmission_d3$nodes %>%
</center>
+
  left_join(
 +
    pop %>%
 +
      mutate(name = as.character(id)) %>%
 +
      select(name, state, infect_time, genome),
 +
    by = "name"
 +
  ) %>%
 +
  mutate(
 +
    tooltip = paste0(
 +
      "Agent ID: ", name,
 +
      "<br>State: ", state,
 +
      "<br>Infection Time: ", infect_time,
 +
      "<br>Genome: ", ifelse(!is.na(genome), substr(genome, 1, 10), "NA"), "..."
 +
    )
 +
  )
  
*Model 4: The genotype and the risk factor each influence risk by themselves:
+
# Create force directed network
 +
forceNetwork(
 +
  Links = transmission_d3$links,
 +
  Nodes = transmission_d3$nodes,
 +
  Source = 'source',
 +
  Target = 'target',
 +
  NodeID = 'name',
 +
  Group = 'state',
 +
  Nodesize = 5,
 +
  opacity = 0.8,
 +
  zoom = TRUE,
 +
  fontSize = 12,
 +
  linkDistance = 100,
 +
  charge = -30,
 +
  opacityNoHover = 0.5,
 +
  legend = TRUE
 +
)
 +
</pre>
  
<center>
+
Also try to experiment with:
[[Image:SMHS Epi Figure 13.png| 500 px]]
 
</center>
 
  
 +
: 1. Examine Selection Pressure: Modify the calc_fitness() function to give a 50% transmission boost to any sequence containing the pattern GAA. How does this change the speed of the outbreak?
  
*The study of gene-environment interactions, use epidemiologic techniques to identify both components (gene and environment):
+
: 2. Are there Bottleneck Effects? Reduce the ''initially_infected' to 1 and observe how the genetic diversity of the final population changes compared to starting with 10 infected individuals.
**Case-control studies
 
**Cohort studies
 
**Case only studies
 
**Case-parental control
 
**Affected relative pair
 
**Twin studies
 
  
To model Gene-Environment interactions:$ outcome = gene + environment + gene-environment interaction. Y=G+E+GXE $
+
: 3. Contact Tracing: Use the ''transmission_graph'' to identify '''Super-spreaders''', i.e., nodes with high out-degree.
  
 +
=== References ===
  
 +
# [https://socr.umich.edu/DSPA2/DSPA2_notes/DSPA_Appendix_26_SMHS_Epidemiology.html DSPA2 Appendix 26: Eidemiology]
 +
# Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott, 2008.
 +
# Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, 2013.
 +
# Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2010.
  
  
  
 
<hr>
 
<hr>
* SOCR Home page: http://www.socr.umich.edu
+
* SOCR Home page: https://socr.umich.edu
  
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Epidemiology}}
+
{{translate|pageName=https://wiki.socr.umich.edu/index.php?title=SMHS_Epidemiology}}

Latest revision as of 15:49, 17 February 2026

Scientific Methods for Health Sciences - Epidemiology

Overview

Epidemiology is the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow's investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.

A core framework in epidemiology is the "epidemiologic triad" of agent, host, and environment, but contemporary approaches emphasize the "person, place, and time" dimensions:

  • Person: Characteristics of individuals, such as age, sex, genetics, socioeconomic status, and behaviors that influence susceptibility or exposure.
  • Place: Geographic variations, including urban vs. rural settings, climate, or access to healthcare, which can reveal environmental or social risk factors.
  • Time: Temporal patterns, such as seasonal trends, secular changes over years, or sudden outbreaks, helping identify emerging threats or intervention effects.

This section delves into genetic epidemiology, bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.

Motivation

This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:

  • Understand Genetic Foundations: Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tending to be inherited together unless separated by recombination).
  • Analyze Population Dynamics: Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
  • Evaluate Associations: Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
  • Apply Computational Methods: Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
  • Additional Skills: Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.

These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.

Theory: The Human Genome and Mutation

The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000–25,000 genes that orchestrate cellular functions, development, and responses to the environment. Mutations, changes in this DNA sequence, can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.

Figure 1: Illustration of human chromosomal structure, highlighting key features like centromeres, telomeres, and banding patterns.

Chromosomal Structure

Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.

  • Banding: Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
  • Karyotype: The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
  • Functional Elements:
 * Centromeres: Central constrictions composed of repetitive alpha-satellite DNA sequences. They serve as attachment points for spindle fibers during mitosis and meiosis, ensuring proper chromosome segregation. Centromeric dysfunction can lead to aneuploidy (abnormal chromosome numbers).
 * Telomeres: Protective caps at chromosome ends, consisting of repetitive TTAGGG sequences (in humans) bound by shelterin proteins. Telomeres shorten with each somatic cell division due to the "end-replication problem," contributing to cellular aging (senescence). In germ cells and stem cells, telomerase enzyme maintains telomere length. Shortened telomeres are linked to diseases like dyskeratosis congenita and increased cancer risk.
 * Chromatin: The complex of DNA and proteins (histones) that forms chromosomes. It exists in two states:
   * Euchromatin: Loosely packed, transcriptionally active, and rich in genes and regulatory elements.
   * Heterochromatin: Tightly packed, transcriptionally silent, often containing repetitive DNA (e.g., satellites, transposons) near centromeres and telomeres. Epigenetic modifications (e.g., histone methylation) regulate chromatin states.

Mutations and Abnormalities

Mutations are heritable changes in the DNA sequence, occurring at a rate of about 10⁻⁸ per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.

Figure 2: Schematic of common structural chromosomal abnormalities, including deletion, duplication, translocation, and inversion.
  • Structural Abnormalities (Large-scale changes visible under a microscope, affecting thousands to millions of base pairs):
 * Deletion: Removal of a chromosomal segment, leading to loss of genes (e.g., cri-du-chat syndrome from 5p deletion).
 * Duplication: Extra copy of a segment, potentially causing gene dosage imbalances (e.g., Charcot-Marie-Tooth disease from PMP22 duplication).
 * Translocation: Exchange of segments between non-homologous chromosomes. Balanced translocations may be asymptomatic but increase risks in offspring; unbalanced ones cause disorders (e.g., chronic myeloid leukemia from t(9;22) "Philadelphia chromosome").
 * Inversion: Reversal of a segment within a chromosome, which can disrupt genes or lead to abnormal recombination (e.g., hemophilia A inversions).
 * Other Types: Isochromosomes (duplicated arms, e.g., i(Xq) in Turner syndrome) or ring chromosomes (circular formations from deletions at both ends).
  • Point Mutations (Small-scale changes at the nucleotide level, detected via sequencing):
 * Nucleotide Substitution: Replacement of one base with another.
   * Silent: No amino acid change (due to codon degeneracy).
   * Missense: Amino acid change (e.g., sickle cell anemia from GAG to GTG in beta-globin gene).
   * Nonsense: Introduces premature stop codon, truncating the protein (e.g., cystic fibrosis).
 * Indels: Insertions or deletions of nucleotides. Small indels can cause frameshifts, altering the reading frame and producing dysfunctional proteins (e.g., Tay-Sachs disease).
 * Splice Site Variation: Mutations in intronic regions affecting mRNA splicing, leading to exon skipping or inclusion of introns (e.g., beta-thalassemia).
  • Epidemiological Relevance: Mutations' population distribution informs disease prevalence. Rare mutations cause Mendelian disorders (e.g., Huntington's), while common variants (SNPs) contribute to complex traits via polygenic risk scores. Tools like next-generation sequencing (NGS) enable large-scale mutation detection in cohort studies.

Population Genetics

Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.

Allele and Genotype Frequencies

The gene pool is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.

  • Allele Frequency: Proportion of a specific allele at a locus. For a biallelic locus, frequencies sum to 1.

\(\text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}.\)

Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (2*40 + 50) / 200 = 0.65; q (a) = (2*10 + 50) / 200 = 0.35.
  • Genotype Frequency: Proportion of individuals with a specific genotype (e.g., P(AA) = number of AA / total individuals).
  • Applications: Used in GWAS to compare frequencies between cases and controls; deviations can indicate associations.

Hardy-Weinberg Equilibrium (HWE)

HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p):

\[P(AA) = p^2\ (homozygous\ dominant)\]

\[P(Aa) = 2pq\ (heterozygous)\]

\[P(aa) = q^2\ (homozygous\ recessive)\]

  • Deviations from HWE: Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.
  • Testing for HWE: Use Chi-squared goodness-of-fit test.
Formula\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},\]

df=1 for biallelic loci.

Critical value: >3.84 for p<0.05 (reject HWE).
Example:
Observed: 400 AA, 500 Aa, 100 aa (total 1000). p=0.65, q=0.35.
Expected: 422.5 AA, 455 Aa, 122.5 aa.
χ² ≈ 10.5 (p<0.05, deviation).

R Implementation for HWE: This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the HardyWeinberg package.

# Hardy-Weinberg Equilibrium Test
# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
obs <- c(AA = 400, Aa = 500, aa = 100)  # Example observed counts

# Input validation
if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")

total <- sum(obs)
if (total == 0) stop("No individuals in population")

# Calculate allele frequencies
p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total)  # Frequency of A
q <- 1 - p  # Frequency of a

# Expected counts under HWE
expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)

# Chi-squared test (ensure expected >5 for validity)
if (any(expected < 5)) warning("Expected counts <5; consider exact test")

chi2 <- sum((obs - expected)^2 / expected)
df <- 1  # Degrees of freedom for biallelic
p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)

# Optional: Fisher's exact test for small samples (using contingency table simulation)
# But for simplicity, we use chi-squared here

# Output results
cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")
cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")
cat("Chi-squared =", round(chi2, 4), "df =", df, "p-value =", format.pval(p_val_chi), "\n")
if (p_val_chi < 0.05) {
  cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
} else {
  cat("Fail to reject HWE: Population appears in equilibrium.\n")
}

Pedigree Analysis and Inheritance

Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.

Figure 5: Example pedigree illustrating inheritance patterns, with symbols for affected individuals, carriers, and relationships across generations.

Modes of Inheritance

Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.

  • Autosomal Dominant (AD):
 * Definition: A single copy of the dominant allele (D) is sufficient to express the phenotype. Affected individuals are heterozygous (Dd) or homozygous (DD), while dd are unaffected.
 * Pedigree Characteristics: Vertical transmission (affected individuals in every generation); no skipped generations if fully penetrant. Affects males and females equally; 50% risk to offspring of an affected parent.
 * Example: Huntington's disease (CAG repeat expansion in HTT gene on chromosome 4). Late-onset, neurodegenerative disorder with high penetrance.
 * Epidemiological Notes: Often rare; founder effects in isolated populations increase prevalence. Used in linkage studies to map genes.
 * Limitations: Incomplete penetrance or variable expressivity (different symptom severity) can mimic other patterns.
  • Autosomal Recessive (AR):
 * Definition: Requires two copies of the recessive allele (dd) for expression. Heterozygotes (Dd) are asymptomatic carriers.
 * Pedigree Characteristics: Horizontal transmission (affected siblings from unaffected parents); skips generations. Consanguinity (e.g., cousin marriages) increases risk; equal male-female incidence.
 * Example: Cystic fibrosis (mutations in CFTR gene on chromosome 7). Affects lung and digestive functions; carrier frequency ~1/25 in Caucasians.
 * Epidemiological Notes: Higher prevalence in populations with high carrier rates (e.g., due to heterozygote advantage, like sickle cell anemia in malaria-endemic areas). Screening programs target carriers.
 * Limitations: Phenocopies (environmental mimics) can complicate diagnosis.
  • X-Linked Recessive (XR):
 * Definition: Mutation on the X chromosome; males (X^c Y) express with one copy due to hemizygosity, while females (X^C X^c) are carriers (often unaffected due to X-inactivation).
 * Pedigree Characteristics: No male-to-male transmission; affected males pass to all daughters (carriers) but no sons. Mother-to-son transmission; more males affected.
 * Example: Hemophilia A (mutations in F8 gene on Xq28). Bleeding disorder; historical prevalence in European royalty.
 * Epidemiological Notes: Skewed sex ratios in affected individuals; carrier testing in families. Lyonization (random X-inactivation) can cause mild symptoms in females.
 * Limitations: De novo mutations can appear without family history.
  • Additional Modes:
 * X-Linked Dominant (XD): Rare; affects females more (e.g., Rett syndrome, MECP2 gene). Lethal in males or mild; female-to-offspring transmission.
 * Y-Linked: Male-only transmission (e.g., azoospermia factors on Y chromosome). Rare, as Y has few genes.
 * Mitochondrial: Maternal inheritance (mtDNA from mother); affects both sexes but no father-to-child. Variable due to heteroplasmy (mixed mtDNA). Example: Leber's hereditary optic neuropathy.

Probability in Pedigrees

Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for Bayesian risk assessment in genetic counseling.

  • Key Formula: The likelihood of observing a pedigree under a genetic model is\[P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i),\]
where the product is over all individuals, \(P(\text{genotype}_i)\) is based on Mendelian laws and parental genotypes, and \(P(\text{phenotype}_i | \text{genotype}_i)\) accounts for penetrance.
  • Penetrance: Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.
  • Variable Expressivity: Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).
  • Example: In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.
  • Applications: Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).
  • Limitations: Assumes accurate family history; ignores modifiers like environment.

Linkage Analysis

Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.

  • Genetic Linkage: Occurs when genes are close on the same chromosome, violating independent assortment.
  • Recombination Fraction (θ): Proportion of gametes with recombination between two loci.

\[\theta = 0.5\]: No linkage (independent, >50 cM apart). \[\theta < 0.5\]: Linkage (e.g., θ=0.1 means 10% recombination). \[\theta = 0\]: Complete linkage (no recombination, syntenic loci).

Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.
  • Phases: Coupling (alleles on same chromosome) vs. repulsion (opposite).
  • Applications: Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).

LOD Score

The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.

  • Formula\[Z(\theta) = \log_{10} \frac{L(\text{data} | \theta=\hat{\theta})}{L(\text{data} | \theta=0.5)}\], where \(\hat{\theta}\) is the maximum likelihood estimate of θ, and L is the likelihood function.
  • Calculation: Involves summing over possible phases and recombinants in pedigrees.
  • Interpretation: Positive Z favors linkage; threshold Z≥3 for significance (false positive rate 5%); Z≤-2 excludes linkage.
LOD Score Odds Ratio Interpretation
Z ≤ -2 ≤1:100 Strong evidence against linkage (exclusion)
-2 < Z < 3 Indeterminate Suggestive but not conclusive
Z ≥ 3 ≥1000:1 Strong evidence for linkage (genome-wide significance)
  • Example: In a family with a disease and marker, if data is 1000 times more likely under θ=0.1 than 0.5, Z=3.
  • Limitations: Requires large pedigrees; sensitive to model misspecification (e.g., wrong penetrance). Superseded by GWAS for complex diseases.

Linkage Disequilibrium (LD) and Association

LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.

  • Causes of LD: Mutation, selection, genetic drift, population admixture, or bottlenecks. High LD in isolated populations (e.g., Ashkenazi Jews).
  • Decay: LD decreases with genetic distance and time; measured in haplotype blocks.
  • Vs. Linkage: Linkage is family-based (meiotic); LD is population-based (historical).
  • Applications: Imputation in GWAS; fine-mapping; inferring evolutionary history.

Measures of LD

Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.

  • D (Disequilibrium Coefficient):
Formula\[D_{AB} = p_{AB} - p_A p_B.\]
Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.
Limitations: Sensitive to allele frequencies; not comparable across loci.
  • D' (Normalized D):
Formula: if \(D<0\), \(D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))}\)
if \(D>0\), \(D' = \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))}\).
Interpretation: Ranges -1 to 1; |D'|=1: Complete LD (no recombination evidence); |D'|<1: Partial.
Use: Detects historical recombination.
  • r² (Correlation Coefficient):
Formula\[r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}\].
Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).
Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.

Additional Considerations

  • Haplotypes: Combinations of alleles (e.g., estimated via EM algorithm in PHASE software).
  • Visualization: LD plots (triangular heatmaps) using tools like Haploview.
  • Best Practices: Account for population structure (e.g., via principal components) to avoid spurious LD. In epidemiology, LD informs polygenic risk scores.
  • Limitations: LD varies by ancestry (e.g., higher in Europeans than Africans); tagging fails for rare variants.

R Implementation for LD: This code computes D, D', and r² from example haplotype frequencies or counts. It uses base R for simplicity; for real data, consider packages like genetics or snpStats.

# Linkage Disequilibrium Measures
# Input: Haplotype counts (e.g., from population data)
# Assume biallelic loci A/a, B/b
count_AB <- 120  # AB haplotypes
count_Ab <- 80   # Ab
count_aB <- 60   # aB
count_ab <- 140  # ab
total <- sum(c(count_AB, count_Ab, count_aB, count_ab))

# Haplotype frequencies
p_AB <- count_AB / total
p_Ab <- count_Ab / total
p_aB <- count_aB / total
p_ab <- count_ab / total

# Allele frequencies
p_A <- p_AB + p_Ab
p_a <- 1 - p_A
p_B <- p_AB + p_aB
p_b <- 1 - p_B

# D
D <- p_AB - p_A * p_B

# D' (normalized)
D_max_pos <- min(p_A * p_b, p_a * p_B)
D_max_neg <- -min(p_A * p_B, p_a * p_b)
D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)

# r-squared
r_sq <- D^2 / (p_A * p_a * p_B * p_b)

# Output
cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")
cat("D =", round(D, 4), "\n")
cat("D' =", round(D_prime, 4), "\n")
cat("r-squared =", round(r_sq, 4), "\n")
if (abs(D_prime) == 1) cat("Complete LD detected.\n") else cat("Partial or no LD.\n")


Genome-Wide Association Studies (GWAS)

GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.

  • Statistical Model: Typically uses logistic regression for case-control studies.

\[\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}.\]

  • Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., \(p < 5 \times 10^{-8}\)) are required to avoid false positives.

Gene-Environment Interactions

Disease risk is often modeled as a combination of genetics (\(G\)), environment (\(E\)), and their interaction (\(G \times E\)). \[Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon.\]

Interaction Models:

  • Synergistic: Genotype exacerbates the risk factor (or vice versa).
  • Independent: Both factors influence risk but do not interact.

SMHS Epi Figure 11.png
Model: Genotype exacerbates the effect of the risk factor

Core Epidemiological Measures

In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.

Absolute Risk Reduction (ARR)

  • Definition: The difference in event rates (incidences) between a control (or unexposed) group and a treatment (or exposed) group. ARR measures the absolute effect of an intervention or exposure on outcome risk.
  • Formula\[ARR = I_{\text{control}} - I_{\text{treatment}}\], where \(I\) represents incidence (proportion of events).
  • Interpretation: A positive ARR indicates risk reduction (benefit); a negative ARR indicates increased risk (harm). It is straightforward but does not account for baseline risk.
  • Example: If the incidence of heart attacks is 10% in the control group and 7% in the treatment group, ARR = 0.10 - 0.07 = 0.03 (3% absolute reduction).
  • When to Use: Prospective studies like randomized controlled trials (RCTs); useful for communicating tangible benefits to patients.
  • Limitations: Sensitive to baseline risk; not ideal for comparing across populations with different event rates.

Relative Risk (RR)

  • Definition: The ratio of the incidence of an outcome in the exposed group to that in the unexposed group. RR assesses how much an exposure increases or decreases the probability of an event.
  • Formula\[RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}\].
  • Interpretation: RR > 1 indicates increased risk due to exposure; RR < 1 indicates protective effect; RR = 1 indicates no association. It is multiplicative and accounts for baseline risk.
  • Example: In a cohort study, smokers have a 20% lung cancer incidence, while non-smokers have 2%. RR = 0.20 / 0.02 = 10 (smokers are 10 times more likely to develop lung cancer).
  • When to Use: Cohort studies or RCTs; preferred for common outcomes.
  • Limitations: Can overestimate associations for rare events; not applicable in case-control studies.

Odds Ratio (OR)

A 2×2 contingency table with odds ratio formula.

2 × 2 Contingency Table for Odds Ratio
Exposed Unexposed Total
Cases (Diseased) a c a + c
Controls (Healthy) b d b + d
Total a + b c + d N

Odds Ratio (OR) Calculation: $$OR = \frac{a / c}{b / d} = \frac{ad}{bc}$$


Formula\[OR = \frac{ad}{bc}\]

Odds Ratio = [Exposed Cases × Unexposed Non-cases] / [Exposed Non-cases × Unexposed Cases].
  • Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group. OR approximates RR when the outcome is rare.
  • Formula: From a 2x2 contingency table (a = exposed cases, b = exposed non-cases, c = unexposed cases, d = unexposed non-cases)\[OR = \frac{ad}{bc}\].
  • Interpretation: OR > 1 suggests positive association; OR < 1 suggests inverse association; OR = 1 suggests no association. It is often used in logistic regression.
  • Example: In a case-control study of diabetes and obesity: 80 obese diabetics (a), 20 obese non-diabetics (b), 30 non-obese diabetics (c), 70 non-obese non-diabetics (d). OR = (80*70) / (20*30) = 9.33 (obesity increases odds of diabetes by over 9 times).
  • When to Use: Case-control studies or when incidence data is unavailable; common in meta-analyses.
  • Limitations: Not directly interpretable as risk for common outcomes; can differ from RR if events are frequent.

Number Needed to Treat (NNT) or Harm (NNH)

  • Definition: The average number of patients who need to be treated (or exposed) to prevent (or cause) one additional outcome compared to the control. NNT is based on ARR and translates statistical effects into clinical relevance.
  • Formula\[NNT = \frac{1}{|ARR|}\] (use absolute value for magnitude; sign of ARR determines benefit vs. harm).
  • Interpretation: Lower NNT indicates greater treatment efficacy. If ARR > 0, it's NNT (benefit); if ARR < 0, it's NNH (harm). Infinite NNT means no effect.
  • Example (Benefit): ARR = 0.03 (as above), NNT = 1 / 0.03 ≈ 33.3 (treat 33 patients to prevent one heart attack).
  • Example (Harm): If treatment increases incidence from 50% to 80%, ARR = 0.50 - 0.80 = -0.30, NNH = 1 / 0.30 ≈ 3.3 (treat 3 patients to cause one additional bad outcome).
  • When to Use: RCTs or systematic reviews; helps in shared decision-making and cost-benefit analysis.
  • Limitations: Assumes constant ARR; sensitive to time frame and baseline risk. Confidence intervals should be reported for real-world application.

Additional Considerations

  • Confidence Intervals (CI): Always compute 95% CIs for these measures to assess precision (e.g., using bootstrap methods or formulas in R packages like epitools or epiR).
  • Attributable Risk (AR): Extends RR; AR = \(I_{\text{exposed}} - I_{\text{unexposed}}\) (absolute risk due to exposure).
  • Population Attributable Risk (PAR): PAR = \(I_{\text{population}} - I_{\text{unexposed}}\) (proportion of cases attributable to exposure in the population).
  • Best Practices: Adjust for confounders using multivariable models; interpret in context (e.g., RR may seem large for rare events but have small absolute impact).
  • Software Tools: R (with packages like epiR, Epi, or survival for advanced metrics like Hazard Ratios) or Python (with scipy or lifelines) are commonly used.

R Implementation for Key Measures: This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.

# Install if needed: install.packages("epiR")  # But assuming it's available or use base R

# Example data: 2x2 table for OR/RR (cohort study assumption)
# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
a <- 20  # Exposed cases
b <- 80  # Exposed non-cases
c <- 2   # Unexposed cases
d <- 98  # Unexposed non-cases

# Incidences
I_exposed <- a / (a + b)
I_unexposed <- c / (c + d)

# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
ARR <- I_unexposed - I_exposed  # Positive if treatment reduces risk

# Relative Risk
RR <- I_exposed / I_unexposed

# Odds Ratio
OR <- (a * d) / (b * c)

# Number Needed to Treat/Harm
NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))

# Output
cat("Incidence Exposed:", round(I_exposed, 3), "\n")
cat("Incidence Unexposed:", round(I_unexposed, 3), "\n")
cat("ARR:", round(ARR, 3), "\n")
cat("RR:", round(RR, 3), "\n")
cat("OR:", round(OR, 3), "\n")
cat(type, ":", ifelse(is.finite(NNT), round(NNT, 1), "Infinite"), "\n")

# Harm example (swap for treatment increasing risk)
I_control <- 0.50
I_treatment <- 0.80
ARR_harm <- I_control - I_treatment
NNT_harm <- 1 / abs(ARR_harm)
cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")
cat("NNH:", round(NNT_harm, 1), "\n")


The following code uses a custom function to handle both benefit (reduction in risk) and harm (increase in risk) scenarios dynamically.

# Function to calculate Epidemiological Metrics
calculate_epi_stats <- function(a, b, c, d) {
# Input validation
if (any(c(a, b, c, d) < 0)) stop("Cell counts must be non-negative")

# Basic Incidences
i_exp <- a / (a + b)
i_unexp <- c / (c + d)

# Absolute Risk Reduction (ARR)
# Defined as Control Risk - Treated Risk
arr <- i_unexp - i_exp

# Relative Risk (RR) and Odds Ratio (OR)
# Adding small constant (0.5) if zero to avoid division by zero errors
rr <- i_exp / i_unexp
or <- (a * d) / (b * c)

# NNT / NNH Logic
nnt_val <- if (arr != 0) 1 / abs(arr) else Inf
nnt_label <- ifelse(arr > 0, "NNT (Benefit)",
ifelse(arr < 0, "NNH (Harm)", "No Effect"))

# Return formatted list
return(list(
Incidence_Exposed = round(i_exp, 4),
Incidence_Unexposed = round(i_unexp, 4),
ARR = round(arr, 4),
RR = round(rr, 2),
OR = round(or, 2),
Measure = nnt_label,
Value = round(nnt_val, 1)
))
}

# --- Example 1: Beneficial Treatment ---
# (e.g., Vaccine trial: Exposed = Vaccinated)

result_benefit <- calculate_epi_stats(a=20, b=80, c=40, d=60)
print(result_benefit)

# --- Example 2: Harmful Exposure ---
# (e.g., Smoking vs Lung Cancer)
result_harm <- calculate_epi_stats(a=50, b=50, c=10, d=90)
print(result_harm)


To understand the output and interpret the results, we use the following logic.

Metric Interpretation Calculation
ARR The actual difference in risk between groups. \(I_{unexposed} - I_{exposed}\)
RR How many times more likely the outcome is in the exposed group. \(I_{exposed} / I_{unexposed}\)
OR The ratio of the odds of the outcome occurring in the exposed group. \(\frac{a \times d}{b \times c}\)
NNT The number of patients needed to treat to prevent one bad outcome. \(1 / ARR\)


Note on NNT/NNH: If the ARR is positive (risk decreased), we report Number Needed to Treat (NNT). If the ARR is negative (risk increased), we report Number Needed to Harm (NNH).


Confidence Intervals: In a real-world study, we typically report 95% CIs. Using `prop.test()` in base R to get CIs for risk differences.

Log Transformation: For OR and RR, calculations for CIs are typically done on the scale to ensure symmetry.

Rmd/R Epidemiological simulation of a hypothetical viral epidemic

We'll try to illustrates genomics tracking, prediction, and spatio-temporal quantification of a viral epidemic. For this, we'll build a high-fidelity epidemiological simulation in R, integrating three distinct domains: compartmental modeling (SIR/SEIR), phylodynamics (genomic tracking), and spatial movement.

To keep this self-contained and realistic, we will use a Stochastic Agent-Based Model (ABM) approach. This allows us to track individual "viral sequences" as they mutate and spread across a coordinate system.

The R simulation requires the following libraries:

tidyverse: For data manipulation and plotting.
gganimate: For spatio-temporal visualization.
adegenet or ape: For handling genomic distances.
sf: for projecting this onto real-world maps.


The Core Simulation Engine

This script generates a synthetic population, simulates transmission based on proximity, and introduces mutations (modeled as bit-strings or distance matrices).

library(tidyverse)
library(gganimate)

# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.05  # Probability of a genomic shift per transmission
timesteps <- 50

# --- Initialize Population ---
pop <- tibble(
  id = 1:n_agents,
  x = runif(n_agents, 0, map_size),
  y = runif(n_agents, 0, map_size),
  state = "S",
  genome = 0,        # Base viral lineage
  parent_id = NA,
  infect_time = NA
)

# Infect patient zeros
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0
pop$genome[1:initially_infected] <- sample(1:100, initially_infected)

# --- Run Simulation ---
history <- list()

for (t in 1:timesteps) {
  # 1. Movement (Random Walk)
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
  
  # 2. Transmission
  infected <- which(pop$state == "I")
  susceptible <- which(pop$state == "S")
  
  for (i in infected) {
    # Find neighbors in range
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
    contacts <- susceptible[dists < 5] # Transmission radius
    
    if (length(contacts) > 0) {
      for (target in contacts) {
        if (runif(1) < 0.3) { # Prob of infection
          pop$state[target] <- "I"
          pop$infect_time[target] <- t
          pop$parent_id[target] <- pop$id[i]
          # Genomic Mutation: Slight drift from parent genome
          pop$genome[target] <- pop$genome[i] + 
            sample(c(-1, 0, 1), 1, prob = c(mutation_rate/2, 1-mutation_rate, mutation_rate/2))
        }
      }
    }
  }
  
  # 3. Recovery
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)

Here is the structure of the simulated dataset.

> str(sim_data )
tibble [10,000 × 8] (S3: tbl_df/tbl/data.frame)
 $ id         : int [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
 $ x          : num [1:10000] 89 92.5 28 83.5 65.4 ...
 $ y          : num [1:10000] 86.5 50.2 85.2 47.1 16.4 ...
 $ state      : chr [1:10000] "I" "I" "I" "S" ...
 $ genome     : num [1:10000] 80 53 24 0 0 0 0 0 0 0 ...
 $ parent_id  : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ infect_time: num [1:10000] 0 0 0 NA NA NA NA NA NA NA ...
 $ time       : int [1:10000] 1 1 1 1 1 1 1 1 1 1 ...

Genomic Tracking & Phylogeny

To visualize the "Genomic Tracking," we treat the genome value as a proxy for a sequence. In a real scenario, this would be a high-dimensional Hamming distance, but here we can visualize the Clade Divergence over time.

  • Visualizing the Spatio-Temporal Spread: Using gganimate, we can see how the virus clusters geographically and how the "Genomic Signature" (color) evolves.
anim <- ggplot(sim_data, aes(x = x, y = y, color = as.factor(genome), size = state)) +
  geom_point(alpha = 0.7) +
  scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
  theme_minimal() +
  labs(title = "Viral Spread & Genomic Drift: Time {frame}",
       color = "Genomic Lineage",
       size = "Infection Status") +
  transition_time(time)

# animate(anim) # Uncomment to render

Realistic Epidemiological Quantification

To make this "highly realistic," we should calculate the Effective Reproduction Number (\(R_t\)) and the Spatio-Temporal Kernel 3. Realistic Epidemiological Quantification

\(R_t​=\frac{\text{New Infections at time}\ t}{\text{Infectious Individuals at time}\ t−1}.\)​


  • Quantifying the Clusters: We can use a K-nearest neighbor (KNN) or DBSCAN algorithm on the sim_data to identify hotspots

where specific genomic lineages are dominating. This mimics real-world Genomic Surveillance reports.

# Example quantification of lineage prevalence
library(dplyr)
library(ggplot2)

lineage_summary_plot <- sim_data %>%
    filter(state == "I") %>%
    # Use group_by to organize the data by time and genome
    group_by(time, genome) %>% 
    summarise(count = n(), .groups = "drop") %>%
    ggplot(aes(x = time, y = count, fill = as.factor(genome))) +
    geom_area() +
    labs(title = "Genomic Lineage Succession", x = "Days", y = "Active Cases")

# Display the plot
lineage_summary_plot


To extend this code into an Rmd notebook, we can include

 1. Parameter Sweeping: Wrap the simulation in a function and run it with different  values.
 2. Network Analysis: Use the `parent_id` column to create a transmission tree using the igraph package.
 3. Kernel Density: Add a geom_density_2d() layer to the map to show spatial intensity.

Let's expand the simulation from a simple proxy to a high-fidelity Phylodynamic Model by simulating a founder DNA sequence and allow it to evolve through point mutations (substitutions) as it spreads across the population.

  • DNA Sequences & Mutation: Instead of a single number, each agent now carries a character string. When transmission occurs, there is a probability that one or more bases in the sequence will mutate.
library(tidyverse)
library(igraph)
library(ggraph)
library(stringdist) # For Hamming Distance

# --- Configuration ---
seq_length <- 50
bases <- c("A", "C", "G", "T")
mutation_rate <- 0.01 # Probability per base per transmission

# Create a starting sequence (Patient Zero)
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")

# Helper function to mutate DNA
mutate_dna <- function(sequence) {
  seq_vec <- strsplit(sequence, "")[[1]]
  mutation_mask <- runif(seq_length) < mutation_rate
  if(any(mutation_mask)) {
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
  }
  return(paste(seq_vec, collapse = ""))
}

  • Genetic Distance Matrix: To quantify the "genomic tracking" aspect, we use the Hamming Distance, which counts the number of positions at which the corresponding nucleotides are different.
# Extract all unique sequences found in the simulation
unique_genomes <- unique(pop$genome)

# Calculate Hamming Distance Matrix
dist_matrix <- stringdistmatrix(unique_genomes, unique_genomes, method = "hamming")
rownames(dist_matrix) <- unique_genomes
colnames(dist_matrix) <- unique_genomes

# This matrix allows us to see how far 'Variant B' has drifted from 'Variant A'
  • Integrated Spatio-Temporal & Transmission Model: We will update the simulation to record the transmission pairs to build the `igraph` tree and add the spatial density layer.
# --- Initialization Fixes ---
# Ensure genome column is a character vector to hold DNA strings
pop$genome <- as.character(pop$genome) 
edges <- data.frame(from = integer(), to = integer())

for (t in 1:timesteps) {
  # 1. Movement
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
  
  # 2. Transmission
  infected <- which(pop$state == "I")
  susceptible <- which(pop$state == "S")
  
  for (i in infected) {
    # Check distances against only the susceptible population
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
    contacts <- susceptible[dists < 5] 
    
    for (target in contacts) {
      # Important: Check if target is still susceptible (in case they were infected
      # by someone else earlier in this same timestep loop)
      if (pop$state[target] == "S" && runif(1) < 0.3) { 
        
        # Update Status
        pop$state[target] <- "I"
        pop$infect_time[target] <- t
        pop$parent_id[target] <- pop$id[i]
        
        # 1. Expand Genomic Logic: DNA Mutation
        # We replace the numeric addition with our string mutation function
        new_sequence <- mutate_dna(pop$genome[i]) 
        pop$genome[target] <- new_sequence
        
        # 3. Track Transmission for igraph
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
      }
    }
  }
  
  # 3. Recovery
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)

# --- Visualization: Spatial Intensity ---
ggplot(sim_data %>% filter(state == "I"), aes(x = x, y = y)) +
  geom_density_2d_filled(
    aes(fill = after_stat(level)),
    alpha = 0.6,
    contour_var = "ndensity"   # optional: normalize for better contrast
  ) +
  geom_point(aes(color = genome), alpha = 0.9) +
  scale_fill_viridis_d() +      # discrete fill for the bands
  theme_minimal() +
  labs(title = "Viral Hotspots and Genomic Clustering") +
  theme(legend.position = "none")

  • The Transmission Tree (Phylogeny): Using igraph and ggraph, we can visualize the

Who-Infects-Whom network. This is the gold standard for contact tracing and understanding super-spreading events.

# Create graph object
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

# Plot the tree
ggraph(transmission_graph, layout = 'tree') + 
  geom_edge_diagonal(alpha = 0.2, color = "grey") + 
  geom_node_point(color = "steelblue", size = 2) + 
  theme_void() +
  labs(title = "Reconstructed Transmission Tree")

# An alternative plot_ly() SVG infection graph can be created by
library(plotly)
library(igraph)
library(dplyr)  # For piping and joins

# Extract layout coordinates (tree structure)
layout <- layout_as_tree(transmission_graph)

# Prepare node data with safe column names
node_data <- data.frame(
  node_id = V(transmission_graph)$name,  # Renamed from 'id'
  x = layout[,1],
  y = layout[,2]
)

# Calculate in-degree (number of incoming edges) to identify root nodes
in_degree <- degree(transmission_graph, mode = "in")

# Create nodes dataframe with size based on root status
nodes <- node_data %>%
  mutate(
    size = ifelse(in_degree == 0, 10, 7),  # Root nodes (in-degree=0) are larger
    label = ifelse(in_degree == 0, "Root", "Descendant")
  )

# Prepare edge data with directional arrows
edges <- as_data_frame(transmission_graph, what = "edges") %>%
  left_join(nodes, by = c("from" = "node_id")) %>%
  left_join(nodes, by = c("to" = "node_id"), suffix = c("_from", "_to"))

# 1. Calculate children for each node
children_info <- edges %>%
  group_by(from) %>%
  summarise(children = paste(to, collapse = ", ")) %>%
  rename(node_id = from)

# 2. Calculate parents for each node
parent_info <- edges %>%
  group_by(to) %>%
  summarise(parents = paste(from, collapse = ", ")) %>%
  rename(node_id = to)

# 3. Join this info back to your nodes dataframe
nodes <- nodes %>%
  left_join(children_info, by = "node_id") %>%
  left_join(parent_info, by = "node_id") %>%
  mutate(
    children = ifelse(is.na(children), "None", children),
    parents = ifelse(is.na(parents), "None", parents)
  )

# Create interactive plot
transmission_plot <- plot_ly() %>%
  add_segments(
    data = edges,
    x = ~x_from, y = ~y_from,
    xend = ~x_to, yend = ~y_to,
    line = list(
      color = ~ifelse(x_from < x_to, "#1f77b4", "#ff7f0e"),
      width = 2
    ),
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = nodes,
    x = ~x,
    y = ~y,
    type = "scatter",
    mode = "markers",
    marker = list(
      size = ~size,
      opacity = 0.8,
      line = list(width = 1, color = "black")
    ),
    # Added Parents and Children to the hover text
    text = ~paste0(
      "<b>Node ID:</b> ", node_id, 
      "<br><b>Type:</b> ", label,
      "<br><b>Parent(s):</b> ", parents,
      "<br><b>Children:</b> ", children
    ),
    hoverinfo = "text",
    showlegend = FALSE
  ) %>%
  layout(
    title = "Interactive Viral Transmission Tree (Graph)",
    xaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
    yaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
    hovermode = "closest"
  )

transmission_plot
  • Genetic Distance: By calculating (distance between sequence and ), we can identify if a new "cluster" appearing in a specific city is a local evolution or a new introduction from outside.
  • Spatial Kernel: The geom_density_2d visualizes the spatial transmission kernel, identifying where the force of infection is highest.
  • Tree Topology: A star-like phylogeny suggests rapid exponential growth, while a ladder-like

phylogeny suggests a stable, endemic transmission chain.

# Proper Genetic Distance Matrix
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")

# Optional: Visualize distance as a heatmap
image(as.matrix(dist_mat), main = "Genetic Distance Heatmap")

# SVG plot using plot_ly()
library(plotly)

# 1. Get the unique viral strains
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])

# 2. Generate clean Short IDs (e.g., "Strain 1", "Strain 2", etc.)
# Or use the first few characters if that's where the ID is hidden
short_ids <- paste0("S-", seq_along(unique_strains)) 

# 3. Calculate the matrix using the full sequences
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
m <- as.matrix(dist_mat)

# 4. Handle Infinities
m[is.infinite(m)] <- max(m[is.finite(m)], na.rm = TRUE)

# 5. Assign the CLEAN IDs to the matrix
rownames(m) <- short_ids
colnames(m) <- short_ids

# 6. Plot
fig <- plot_ly(
    z = m, 
    x = colnames(m), 
    y = rownames(m), 
    type = "heatmap",
    colorscale = "Viridis",
    # This customdata trick lets you see the FULL sequence when you hover!
    customdata = matrix(unique_strains, nrow=nrow(m), ncol=ncol(m)),
    hovertemplate = "Row: %{y}<br>Col: %{x}<br>Dist: %{z}<extra></extra>"
  ) %>%
  layout(
    xaxis = list(title = "Strain ID", type = "category"),
    yaxis = list(title = "Strain ID", type = "category", autorange = "reversed")
  )

fig

  • Finally, display the transmission tree graph.
inf_graph <- graph_from_data_frame(edges, directed = TRUE)

ggraph(inf_graph, layout = 'kk') + 
  geom_edge_link(alpha = 0.3, arrow = arrow(length = unit(2, 'mm'))) + 
  geom_node_point(color = "red", size = 1) + 
  theme_void() + 
  labs(title = "Epidemiological Transmission Network")

Applications and Software

Modern epidemiology relies heavily on computational tools.

  • Key R Packages: epiR (Epi measures), genetics (HWE, LD), survival (time-to-event), qqman (GWAS visualization).
  • Online Tools: SOCR Distribution Tables.

Problems

Problem 1: Linkage Mapping

Scenario: Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction \(\theta\).

SMHS Epi Figure 8.png

  1. Calculate LOD Scores:

Using Maximum Likelihood Estimation (MLE), if the phase is known and the data include 4 non-recombinant and 1 recombinant meioses, the likelihood is: \[L(\theta) = (1-\theta)^4 \theta\]. The LOD score is defined as: \[Z(\theta) = \log_{10}\left(\frac{L(\theta)}{L(0.5)}\right)\], where \(L(0.5) = (0.5)^5 = 1/32\).

  1. Corrected Result Table:
\(\theta\) \(Z(\theta)\)
0.0 \(-\infty\)
0.10 0.322
0.20 0.418 (Max LOD)
0.50 0.0

The maximum LOD score occurs at \(\hat{\theta} = 0.20\), consistent with 1 recombinant out of 5 informative meioses. > Note: A LOD > 3 is conventionally required for significant linkage; this example is illustrative.

Problem 2: Number Needed to Treat (NNT) and Harm (NNH)

Scenario: A trial reports 800 events among 1000 patients in Treatment Group A and 600 events among 1200 patients in Control Group B.

  1. Event rates:

\[p_A = 800/1000 = 0.80\], \(p_B = 600/1200 = 0.50\).

  1. Absolute Risk Increase (ARI):

\[\text{ARI} = p_A - p_B = 0.30\].

  1. Since the treatment increases risk, we compute the Number Needed to Harm (NNH):

\[\text{NNH} = \frac{1}{\text{ARI}} = \frac{1}{0.30} \approx 3.33\].

Interpretation: For every 3–4 patients treated with intervention A (vs. control), one additional adverse event occurs. The negative NNT is conventionally reported as NNH.

Problem 3: GWAS Power Analysis (R)

Scenario: Simulate a case-control GWAS with 500 cases and 500 controls, Minor Allele Frequency (MAF) = 0.2, and Odds Ratio (OR) = 1.5. Estimate statistical power at \(\alpha = 0.05\).

The following R function correctly simulates genotypes separately for cases and controls based on genotype relative risks derived from the OR:

 gwas_power_case_control <- function(n_cases = 200, n_controls = 200, 
                                  maf = 0.2, OR = 1.5, 
                                  alpha = 0.05, n_sims = 100) {
 # Genotype coding: 0, 1, 2 copies of effect allele
 # Relative risks under additive log-odds model: 1, OR, OR^2
 rr <- c(1, OR, OR^2)
 
 # Genotype frequencies in controls (Hardy-Weinberg)
 g_ctrl <- dbinom(0:2, size = 2, prob = maf)
 
 # Genotype frequencies in cases (proportional to g_ctrl * rr)
 g_case_unscaled <- g_ctrl * rr
 g_case <- g_case_unscaled / sum(g_case_unscaled)
 
 significant <- numeric(n_sims)
 
 for (i in 1:n_sims) {
   # Simulate genotypes
   geno_cases <- sample(0:2, n_cases, replace = TRUE, prob = g_case)
   geno_ctrls <- sample(0:2, n_controls, replace = TRUE, prob = g_ctrl)
   
   geno <- c(geno_cases, geno_ctrls)
   status <- c(rep(1, n_cases), rep(0, n_controls))
   
   # Fit logistic regression
   model <- glm(status ~ geno, family = binomial)
   p_val <- summary(model)$coefficients[2, 4]
   significant[i] <- as.numeric(p_val < alpha)
 }
 
 return(mean(significant))  # Estimated power
 }
 # Example usage:
 set.seed(2026)
 power_est <- gwas_power_case_control()
 cat("Estimated power:", round(power_est, 3), "\n")

Problem 4: Hardy-Weinberg Equilibrium (HWE) Testing

Scenario: In a GWAS control group (\(n = 1000\)), genotype counts for a SNP are: AA = 240, Aa = 120, aa = 40. Test HWE using an exact test.


 # Install if needed: install.packages("HardyWeinberg")
 library(HardyWeinberg)
 # Observed counts: AA, Aa, aa
 obs <- c(240, 120, 40)
 # Exact HWE test
 hwe_test <- HWExact(obs, verbose = FALSE)
 p_val <- hwe_test$pval
 cat("HWE exact test p-value:", format.pval(p_val, digits = 4), "\n")
 # Interpretation: p < 0.001 suggests deviation from HWE (common QC threshold in GWAS)

Problem 5: Assessing Confounding in Observational Data

Scenario: Evaluate whether age and sex confound the association between an exposure and disease outcome.

We'll use the Logistic Distribution function plogis, which provides the standard Density, distribution function, quantile function and random generation for the logistic distribution with parameters location and scale, and rbinom.

  set.seed(123)
  n <- 5000
  # Simulate confounders and exposure

  age <- rnorm(n, mean = 50, sd = 10)
  sex <- rbinom(n, 1, 0.5)
  exposure <- rbinom(n, 1, plogis(-1 + 0.02 * age + 0.3 * sex))

  # Simulate disease influenced by exposure and confounders
  disease <- rbinom(n, 1, plogis(-2 + 0.9 * exposure + 0.03 * age + 0.4 * sex))


  # Crude model
  crude_or <- exp(coef(glm(disease ~ exposure, family = binomial))["exposure"])


  # Adjusted model
  adj_or <- exp(coef(glm(disease ~ exposure + age + sex, family = binomial))["exposure"])

  # Percent change due to confounding
  confounding_pct <- (crude_or - adj_or) / crude_or * 100
  cat("Crude OR:", round(crude_or, 2), "\n")
  cat("Adjusted OR:", round(adj_or, 2), "\n")
  cat("Percent change:", round(confounding_pct, 1), "%\n")

  if (abs(confounding_pct) > 10) {
   cat("→ Substantial confounding detected (change > 10%).\n")
  } else cat("→ No substantial confounding detected (change < 10%).\n")

Problem 6: Diagnostic Test Accuracy via ROC Analysis

Scenario: A continuous biomarker is measured in 100 diseased and 100 non-diseased individuals. Determine the optimal cutoff, sensitivity, and specificity using Youden’s index.

  # Install if needed: install.packages("pROC")
  library(pROC)

  # Simulate biomarker scores
  set.seed(42)
  score_healthy <- rnorm(100, mean = 0, sd = 1)
  score_diseased <- rnorm(100, mean = 1.5, sd = 1)

  score <- c(score_healthy, score_diseased)
  status <- c(rep(0, 100), rep(1, 100))

  # ROC analysis
  roc_curve <- roc(status, score)
  auc_val <- auc(roc_curve)

  # Optimal threshold (Youden's J statistic)
  opt <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity"))

  cat("AUC:", round(auc_val, 3), "\n")
  cat("Optimal cutoff:", round(opt[["threshold"]], 2), "\n")
  cat("Sensitivity:", round(opt[["sensitivity"]], 3), "\n")
  cat("Specificity:", round(opt[["specificity"]], 3), "\n")

Problem 7

Try to integrate structural fixes, biological realism (fitness selection), and technical optimizations into the above genetics simulation of Phylodynamics and Genomic Surveillance in R. Try to demonstrates how viral mutations, spatial movement, and natural selection interact to shape an outbreak.

The Core Simulation Engine can initialize a population and simulates a Random Walk movement model, by introducing a Fitness Function; mutations at specific genomic loci now increase the transmission probability, simulating the emergence of a Variant of Concern.

library(tidyverse)
library(gganimate)
library(stringdist)
library(igraph)
library(ggraph)
library(plotly)

# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.02
timesteps <- 60
seq_length <- 50
bases <- c("A", "C", "G", "T")

# --- Helper Functions ---
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")

mutate_dna <- function(sequence) {
  seq_vec <- strsplit(sequence, "")[[1]]
  mutation_mask <- runif(seq_length) < mutation_rate
  if(any(mutation_mask)) {
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
  }
  return(paste(seq_vec, collapse = ""))
}

calc_fitness <- function(sequence) {
  gc_content <- nchar(gsub("[AT]", "", sequence)) / nchar(sequence)
  return(0.3 + (gc_content * 0.2)) 
}

# --- Initialize Population ---
pop <- tibble(
  id = 1:n_agents,
  x = runif(n_agents, 0, map_size),
  y = runif(n_agents, 0, map_size),
  state = "S",
  genome = founder_seq,
  parent_id = NA_integer_,
  infect_time = NA_real_
)

# Set state to "I" for initial cases
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0

# --- Run Simulation ---
history <- list()
edges <- data.frame(from = integer(), to = integer())

for (t in 1:timesteps) {
  
  # Assignment operator (<-) to update movement
  # loaded cyclical harmonics
  # pop$x <- pop$x * rep(0.1 + cos(10*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
  # pop$y <- pop$y * rep(0.1 + sin(5*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
  
  # simple stochastic dynamics
  pop$x <- pop$x + rnorm(n_agents, 0, 2)
  pop$y <- pop$y + rnorm(n_agents, 0, 2)

  # Optional: Keep agents within map boundaries
  pop$x <- pmax(0, pmin(map_size, pop$x))
  pop$y <- pmax(0, pmin(map_size, pop$y))

  # 2. Transmission
  infected_idx <- which(pop$state == "I")
  
  for (i in infected_idx) {
    susceptible_idx <- which(pop$state == "S")
    if(length(susceptible_idx) == 0) break
    
    dists <- sqrt((pop$x[susceptible_idx] - pop$x[i])^2 + (pop$y[susceptible_idx] - pop$y[i])^2)
    contacts <- susceptible_idx[dists < 6] 
    
    for (target in contacts) {
      if (pop$state[target] == "S" && runif(1) < calc_fitness(pop$genome[i])) {
        pop$state[target] <- "I"
        pop$infect_time[target] <- t
        pop$parent_id[target] <- pop$id[i]
        pop$genome[target] <- mutate_dna(pop$genome[i])
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
      }
    }
  }

  # subsetting and assignment for Recovery
  recovery_idx <- which(pop$state == "I" & (t - pop$infect_time) >= 10)
  if(length(recovery_idx) > 0) {
    pop$state[recovery_idx] <- "R"
  }
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)
# str(sim_data)

2. Spatio-Temporal Visualization: By mapping the genomic strings to colors, we can visualize the "clade emergence" over time.

anim <- ggplot(sim_data, aes(x = x, y = y, color = genome, size = state)) +
geom_point(alpha = 0.6) +
scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Viral Spread & Selection Pressure: Day {frame}") +
transition_time(time)

# animate(anim)

3. Genomic Tracking & Distance Matrices: To understand how much the virus has drifted, we use the **Hamming Distance**. This calculates the number of base substitutions between any two viral samples.

# Extract final strains
final_strains <- sim_data %>%
filter(state %in% c("I", "R"), !is.na(infect_time)) %>%
distinct(genome) %>%
pull(genome)

dist_mat <- stringdistmatrix(final_strains, final_strains, method = "hamming")
rownames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
colnames(dist_mat) <- paste0("Strain_", 1:length(final_strains))

# Visualize Genetic Distance
plot_ly(z = ~dist_mat, x = colnames(dist_mat), y = rownames(dist_mat), type = "heatmap", colorscale = "Viridis") %>%
layout(title = "Genetic Hamming Distance Matrix")

4. Phylogeny: The Transmission Tree: While a distance matrix shows "how much" change occurred, the Transmission Tree shows "who infected whom." A "star-like" tree indicates rapid expansion, while a "ladder-like" tree indicates a slow, constant chain of transmission.

# Reconstruct the tree using igraph
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

ggraph(transmission_graph, layout = 'tree') +
geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
geom_node_point(aes(color = name), size = 3, show.legend = FALSE) +
theme_void() +
labs(title = "Reconstructed Epidemiological Transmission Tree")

5. Quantifying the Molecular Clock: A key concept in phylodynamics is the Molecular Clock, the idea that mutations accumulate at a relatively constant rate over time.

# Calculate divergence from founder
clock_data <- sim_data %>%
filter(!is.na(infect_time)) %>%
mutate(dist_from_founder = stringdist(genome, founder_seq, method = "hamming"))

ggplot(clock_data, aes(x = infect_time, y = dist_from_founder)) +
geom_jitter(alpha = 0.2, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "The Molecular Clock: Mutation Accumulation",
x = "Time of Infection (Days)",
y = "Hamming Distance from Founder")

Here are a couple of alternative plots ...

library(ggraph)
library(ggplot2)
library(plotly)

# Create the ggraph plot
gg <- ggraph(transmission_graph, layout = 'tree') +
  geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
  geom_node_point(aes(color = name, 
                      text = paste("ID:", name,
                                   "<br>Click for details")), 
                  size = 5) +
  theme_void() +
  labs(title = "Reconstructed Epidemiological Transmission Tree")

# Convert to interactive plotly
ggplotly(gg, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
  )

and ...

library(plotly)
library(igraph)
library(networkD3)

# Convert igraph to networkD3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)

# Prepare nodes with tooltips
nodes_d3 <- transmission_d3$nodes
if(exists("pop")) {
  nodes_d3 <- nodes_d3 %>%
    left_join(
      pop %>% 
        mutate(name = as.character(id)) %>% 
        select(name, state, infect_time, genome),
      by = "name"
    )
} else {
  nodes_d3$state <- "Unknown"
  nodes_d3$infect_time <- NA
  nodes_d3$genome <- NA
}

# Create custom tooltips
nodes_d3$tooltip <- sprintf(
  "ID: %s<br>State: %s<br>Infection Time: %s<br>Genome: %s",
  nodes_d3$name,
  nodes_d3$state,
  ifelse(is.na(nodes_d3$infect_time), "N/A", nodes_d3$infect_time),
  ifelse(is.na(nodes_d3$genome), "N/A", substr(nodes_d3$genome, 1, 15))
)

# Create the force-directed network
forceNetwork(
  Links = transmission_d3$links,
  Nodes = nodes_d3,
  Source = 'source',
  Target = 'target',
  NodeID = 'name',
  Group = 'state',
  Nodesize = 5,
  opacity = 0.9,
  zoom = TRUE,
  fontSize = 14,
  linkDistance = 150,
  charge = -100,
  opacityNoHover = 0.5,
  legend = TRUE,
  bounded = TRUE,
  clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
)

and ...

library(visNetwork)
library(igraph)
library(tidyverse)

# 1. Create the graph object
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

# 2. Prepare nodes data
# Get vertex names
node_names <- V(transmission_graph)$name

# Merge with pop data for additional information
nodes <- data.frame(
  id = node_names,
  label = node_names,
  title = NA,  # Will be filled with tooltip info
  stringsAsFactors = FALSE
)

# Join with pop data to get metadata for tooltips
if(exists("pop")) {
  nodes <- nodes %>%
    left_join(
      pop %>% 
        mutate(id = as.character(id)) %>% 
        select(id, state, infect_time, genome),
      by = "id"
    ) %>%
    mutate(
      title = paste0(
        "<b>Agent ID:</b> ", id, "<br>",
        "<b>State:</b> ", state, "<br>",
        "<b>Infection Time:</b> ", infect_time, "<br>",
        "<b>Genome:</b> ", ifelse(!is.na(genome), substr(genome, 1, 20), "N/A")
      )
    )
} else {
  # Fallback if pop doesn't exist
  nodes$title <- paste0("<b>Agent ID:</b> ", nodes$id)
}

# 3. Prepare edges data
edges_df <- igraph::as_data_frame(transmission_graph, what = "edges")
edges_vis <- data.frame(
  from = edges_df$from,
  to = edges_df$to,
  arrows = "to",
  color = list(color = "gray", opacity = 0.6),
  width = 1,
  smooth = TRUE
)

# 4. Optional: Add colors based on state or infection time
if("state" %in% colnames(nodes)) {
  # Create color mapping for different states
  unique_states <- unique(nodes$state)
  colors <- viridisLite::viridis(length(unique_states))
  color_map <- setNames(colors, unique_states)
  
  nodes$color <- sapply(nodes$state, function(s) {
    if(is.na(s)) "#808080" else color_map[s]
  })
} else if("infect_time" %in% colnames(nodes)) {
  # Color by infection time
  min_time <- min(nodes$infect_time, na.rm = TRUE)
  max_time <- max(nodes$infect_time, na.rm = TRUE)
  
  # Normalize infection time for color scale
  nodes$value <- nodes$infect_time
  nodes$color <- viridisLite::viridis(100)[
    cut(nodes$infect_time, 100, labels = FALSE)
  ]
} else {
  nodes$color <- "#2C3E50"  # Default color
}

# 5. Create the interactive network
visNetwork(nodes, edges_vis, 
           main = "Interactive Transmission Tree",
           submain = "Hover over nodes for details") %>%
  visNodes(
    shape = "dot",
    size = 15,
    borderWidth = 2,
    borderWidthSelected = 3,
    shadow = list(enabled = TRUE, size = 10)
  ) %>%
  visEdges(
    shadow = TRUE,
    smooth = list(type = "curvedCW", roundness = 0.1)
  ) %>%
  visOptions(
    highlightNearest = list(
      enabled = TRUE,
      degree = 1,
      hover = TRUE,
      algorithm = "hierarchical"
    ),
    nodesIdSelection = list(
      enabled = TRUE,
      values = nodes$id,
      style = 'width: 150px; height: 26px'
    )
  ) %>%
  visInteraction(
    hover = TRUE,
    hoverConnectedEdges = TRUE,
    tooltipDelay = 0,
    navigationButtons = TRUE,
    dragNodes = TRUE,
    dragView = TRUE,
    zoomView = TRUE
  ) %>%
  visPhysics(
    stabilization = TRUE,
    solver = "forceAtlas2Based",
    forceAtlas2Based = list(
      gravitationalConstant = -50,
      centralGravity = 0.01,
      springLength = 100,
      springConstant = 0.08,
      damping = 0.4,
      avoidOverlap = 1
    )
  ) %>%
  visLayout(randomSeed = 123)  # For reproducible layout

As well as ...

# Alternative using plotly's network graph
library(networkD3)

# Convert to D3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)

# Add node attributes
transmission_d3$nodes <- transmission_d3$nodes %>%
  left_join(
    pop %>% 
      mutate(name = as.character(id)) %>% 
      select(name, state, infect_time, genome),
    by = "name"
  ) %>%
  mutate(
    tooltip = paste0(
      "Agent ID: ", name, 
      "<br>State: ", state,
      "<br>Infection Time: ", infect_time,
      "<br>Genome: ", ifelse(!is.na(genome), substr(genome, 1, 10), "NA"), "..."
    )
  )

# Create force directed network
forceNetwork(
  Links = transmission_d3$links,
  Nodes = transmission_d3$nodes,
  Source = 'source',
  Target = 'target',
  NodeID = 'name',
  Group = 'state',
  Nodesize = 5,
  opacity = 0.8,
  zoom = TRUE,
  fontSize = 12,
  linkDistance = 100,
  charge = -30,
  opacityNoHover = 0.5,
  legend = TRUE
)

Also try to experiment with:

1. Examine Selection Pressure: Modify the calc_fitness() function to give a 50% transmission boost to any sequence containing the pattern GAA. How does this change the speed of the outbreak?
2. Are there Bottleneck Effects? Reduce the initially_infected' to 1 and observe how the genetic diversity of the final population changes compared to starting with 10 infected individuals.
3. Contact Tracing: Use the transmission_graph to identify Super-spreaders, i.e., nodes with high out-degree.

References

  1. DSPA2 Appendix 26: Eidemiology
  2. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott, 2008.
  3. Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, 2013.
  4. Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2010.





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif