SMHS Epidemiology

Scientific Methods for Health Sciences - Epidemiology

Overview

Epidemiology is the study of the distribution, determinants, and control of health and disease in populations. While early epidemiology focused on infectious agents, modern epidemiology encompasses genetic factors, environmental exposures, and their complex interactions. This section provides an in-depth discussion of these patterns, specifically identifying health-related risk factors and outcomes in terms of person, place, and time.

Motivation

By the end of this module, learners should be able to:

Understand Genetic Foundations: Describe basic features of the human genome, the distribution of mutations, and principles of segregation and linkage.
Analyze Population Dynamics: Apply quantitative genetic concepts to study the relationship between genetic variation and disease variation, including Hardy-Weinberg equilibrium.
Evaluate Associations: Understand prototypical gene-disease relationships, interpret Genome-Wide Association Studies (GWAS), and recognize gene-environment interactions.
Apply Computational Methods: Perform basic genetic association analysis using R and interpret key epidemiological measures like NNT and OR.

Theory: The Human Genome and Mutation

Chromosomal Structure

Chromosomes consist of highly condensed DNA.

Banding: Chromosomes can be stained to reveal banding patterns. Dark bands represent regions rich in Adenine (A) and Thymine (T), containing millions of nucleotides.
Karyotype: A normal human karyotype consists of 46 chromosomes: 22 pairs of autosomes and 1 pair of sex chromosomes (XX or XY).
Functional Elements:

*Centromeres:* Large arrays of repetitive DNA where spindle fibers attach during mitosis.
*Telomeres:* Repetitive sequences acting as a "cap" to provide stability; these shorten with each cell division in somatic cells.
*Chromatin:* Divided into Euchromatin (lightly condensed, gene-rich) and Heterochromatin (highly condensed, often repetitive).

Mutations and Abnormalities

Mutations are alterations in the DNA sequence that can occur in somatic cells or gametes.

Structural Abnormalities:

*Deletion:* Loss of genetic material (e.g., terminal deletion).
*Duplication:* Repetition of a chromosomal segment.
*Translocation:* Rearrangement of parts between non-homologous chromosomes.
*Inversion:* A segment of the chromosome is reversed.

Point Mutations (DNA Sequence):

*Nucleotide Substitution:* Alteration of a base sequence without changing the number of bases (e.g., Missense, Nonsense).
*Indels:* Insertions or deletions that alter the number of nucleotides, potentially causing frameshifts.
*Splice Site Variation:* Alterations in non-coding regions affecting RNA splicing.

Population Genetics

Allele and Genotype Frequencies

The Gene Pool represents all available genetic variation in a population.

Allele Frequency: The prevalence of a particular allele in a population.

\[\text{Allele Frequency} = \frac{\text{Number of specific alleles}}{2 \times \text{Number of people}}\].

Genotype Frequency: The prevalence of a particular genotype (e.g., AA, Aa, aa).

Hardy-Weinberg Equilibrium (HWE)

In a large, stable population with random mating, allele frequencies predict genotype frequencies. For a biallelic locus with alleles \(A\) (frequency \(p\)) and \(a\) (frequency \(q\)), where \(p+q=1\):

\(P(AA) = p^2\)
\(P(Aa) = 2pq\)
\(P(aa) = q^2\).

Testing for HWE: To determine if a population is in HWE, use a Chi-squared (\(\chi^2\)) test comparing Observed (\(O\)) vs. Expected (\(E\)) counts. \[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\] with 1 degree of freedom. If \(\chi^2 > 3.84\) (p < 0.05), the null hypothesis of HWE is rejected, suggesting evolutionary forces (selection, migration, non-random mating) are at play.

R Implementation for HWE:

# Hardy-Weinberg Equilibrium Test
# Method: Manual calculation
obs <- c(AA = 400, Aa = 500, aa = 100)  # observed counts
total <- sum(obs)
p <- (2*obs["AA"] + obs["Aa"]) / (2*total)  # allele frequency A
q <- 1 - p

# Expected counts under HWE
expected <- c(p^2, 2*p*q, q^2) * total

# Chi-square test
chi2 <- sum((obs - expected)^2 / expected)
p_val <- pchisq(chi2, df = 1, lower.tail = FALSE)
cat("Chi-squared =", round(chi2, 4), "p-value =", format.pval(p_val), "\n")

Pedigree Analysis and Inheritance

Pedigrees trace the transmission of traits through generations.

Modes of Inheritance

Autosomal Dominant:

Individuals with the dominant allele (\(D\)) develop the disease.
Vertical transmission (every affected child has an affected parent).
Occurs in both males and females equally.

Autosomal Recessive:

Individuals must inherit two copies of the recessive allele (\(d\)) to be affected (\(dd\)).
Heterozygotes (\(Dd\)) are carriers.
Often appears in siblings of unaffected parents (horizontal pattern).

X-Linked Recessive:

Females carrying the mutation (\(X^C X^c\)) are usually unaffected carriers.
Males with the mutation (\(X^c Y\)) are affected.
No male-to-male transmission; mother-to-son transmission is characteristic.

Probability in Pedigrees

To estimate risk, we calculate the probability of the pedigree given a hypothesis: \[P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i)\]. This requires defining Penetrance: the probability of expressing a phenotype given a genotype. Incomplete penetrance occurs when an individual with a susceptible genotype does not exhibit the phenotype.

Linkage Analysis

Genetic linkage measures the proximity of genes on a chromosome.

Recombination Fraction (\(\theta\)): The probability that two loci will recombine during gamete formation.

\(\theta = 0.5\): Independent assortment (unlinked).
\(\theta < 0.5\): Linkage exists.
\(\theta = 0\): Complete linkage.

LOD Score

The Logarithm of Odds (LOD) score compares the likelihood of the data under linkage (\(\theta = \hat{\theta}\)) versus no linkage (\(\theta = 0.5\)). \[Z(\theta) = \log_{10} \frac{L(\theta=\hat{\theta})}{L(\theta=0.5)}\].

Interpretation:

LOD Score	Interpretation
\(Z = -2\)	100:1 odds against linkage
\(Z = +3\)	1000:1 odds in favor of linkage (Threshold for significance)

Linkage Disequilibrium (LD) and Association

While linkage is observed in families, Linkage Disequilibrium (LD) is a population-based correlation between alleles at different loci.

Measures of LD

D (Disequilibrium Coefficient):

\[D_{AB} = p_{AB} - p_A p_B\].

D' (Normalized D):

Ranges from -1 to +1. \(|D'| = 1\) implies no evidence of recombination.

\(r^2\) (Correlation coefficient):

\[r^2 = \frac{D^2}{p_A(1-p_A)p_B(1-p_B)}\] \[r^2\] is preferred for association studies as it is less sensitive to allele frequency differences. \(r^2=1\) implies perfect proxy markers.

R Implementation for LD:

# Calculating LD Measures
# Assuming p_AB, p_A, p_B are calculated from haplotype counts
D <- p_AB - p_A * p_B
r_sq <- D^2 / (p_A * (1-p_A) * p_B * (1-p_B))
cat("r-squared =", r_sq, "\n")

Genome-Wide Association Studies (GWAS)

GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.

Statistical Model: Typically uses logistic regression for case-control studies.

\[\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}\].

Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., \(p < 5 \times 10^{-8}\)) are required to avoid false positives.

Gene-Environment Interactions

Disease risk is often modeled as a combination of genetics (\(G\)), environment (\(E\)), and their interaction (\(G \times E\)). \[Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon\].

Interaction Models:

Synergistic: Genotype exacerbates the risk factor (or vice versa).
Independent: Both factors influence risk but do not interact.

Model: Genotype exacerbates the effect of the risk factor

Core Epidemiological Measures

In addition to genetic metrics, standard epidemiological measures are vital for public health.

Relative Risk (RR)\[RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}\].
Odds Ratio (OR)\[OR = \frac{ad}{bc}\] (used in Case-Control studies).
Number Needed to Treat (NNT):

The number of patients that must be treated to prevent one additional bad outcome.

\[NNT = \frac{1}{I_{\text{control}} - I_{\text{treatment}}}\].

*Note:* If the risk difference is negative (treatment increases risk), this becomes Number Needed to Harm (NNH).

R Implementation for NNT:

control_risk <- 0.50
treatment_risk <- 0.80 # Example where treatment is actually harmful/higher risk
RD <- control_risk - treatment_risk
NNT <- ifelse(RD != 0, 1/abs(RD), Inf)
if (RD < 0) {
  cat("Result is NNH (Harm):", round(NNT, 1))
} else {
  cat("Result is NNT (Benefit):", round(NNT, 1))
}

Applications and Software

Modern epidemiology relies heavily on computational tools.

Key R Packages: `epiR` (Epi measures), `genetics` (HWE, LD), `survival` (time-to-event), `qqman` (GWAS visualization).
Online Tools: SOCR Distribution Tables.

Problems

Problem 1: Linkage Mapping

Scenario: Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction \(\theta\).

1. Calculate LOD Scores: Using the Maximum Likelihood Estimation (MLE), if the phase is unknown, we average likelihoods. If \(\theta=0.1\), and calculating for a specific phase arrangement: \[L(\theta) = (1-\theta)^4 \theta\] (based on 4 non-recombinants, 1 recombinant). \[Z(\theta) = \log_{10}\frac{L(\theta)}{L(0.5)}\] 2. Result Table:

\(\theta\)	\(Z(\theta)\)
0.0	\(-\infty\)
0.10	0.022
0.20	0.124 (Max LOD)
0.50	0.0

The maximum LOD score occurs at \(\hat{\theta} = 0.20\).

Problem 2: NNT Calculation

Scenario: A trial shows 800/1000 events in Treatment Group A and 600/1200 events in Control Group B. 1. \(p_A = 0.80\), \(p_B = 0.50\). 2. \(NNT = \frac{1}{p_B - p_A} = \frac{1}{0.5 - 0.8} = -3.33\). Interpretation: Since the value is negative, this represents a Number Needed to Harm (NNH) of 3.3. For every ~3-4 patients treated, one additional adverse event occurs compared to the control.

Problem 3: GWAS Power Analysis (R)

Scenario: Simulate a study with 500 cases/500 controls, Minor Allele Frequency (MAF) = 0.2, OR = 1.5.

simulate_gwas_power <- function(n_cases, n_controls, maf, OR, alpha = 0.05, n_sims = 100) {
  significant <- numeric(n_sims)
  n_total <- n_cases + n_controls
  
  for (i in 1:n_sims) {
    geno <- rbinom(n_total, 2, maf) # Generate Genotypes
    beta <- log(OR)
    
    # Logistic model simulation
    log_odds <- -2 + beta * (geno - mean(geno))
    prob <- plogis(log_odds)
    status <- rbinom(n_total, 1, prob) 
    
    # Test
    model <- glm(status ~ geno, family = binomial)
    p_val <- summary(model)$coefficients[2, 4]
    significant[i] <- as.numeric(p_val < alpha)
  }
  return(mean(significant))
}

References

Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. 3rd ed. Lippincott, 2008.
Clayton D, Hills M. Statistical Models in Epidemiology. Oxford, 2013.
Ziegler A, König IR. A Statistical Approach to Genetic Epidemiology. Wiley, 2010.

SOCR Home page: http://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige