SMHS AssociationCausality

Jump to: navigation, search

Scientific Methods for Health Sciences - Association and Causality


An association is any relationship between two measured quantities that renders them statistically dependent, meaning that the occurrence of one does affect the probability of the other as indicated in the probability theory. While causality is the relation between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first. Generally speaking, association is a much broader relationship compared to causality. If we see two subjects has the causality relationship that assumes that they must be associated, however, an association relationship alone is not enough to address a causal relationship. There are many statistical measures of association that can be used to infer the presence or absence of association in a sample of data. Such as odds ratio (OR), risk ratio (RR) and absolute risk reduction (ARR). Yet, the proof of causality is much more rigid process.


Suppose we study Lung Cancer in the context of heavy smokers. The table below illustrates some (simulated) data. One clear healthcare question in this case-study could be: “Is heavy smoking associated with higher incidence of lung cancer?” and “Is heavy smoking the causation of lung cancer?” To address this question, we can look at the relative risk of tobacco usage.

Lung cancer (LC) Total
Yes (A) No
Heavy Smokers(HS) Yes 18 80 98 (B)
No 7 95 102 (C)
Total 25 175 200

Computing the (conditional!) probabilities (P) of lung cancer (LC) given either heavy smokers, P1, Non-heavy smokers, P2, we can form their ratio to determine if the relative risk of lung cancer (LC) is higher in heavy smokers (HS), relative to non-heavy smokers (NHS).

$P_{1}= P(LC|HS) = \frac {18} {98}= 0.184$

$P_{2}= P(LC|NHS) = \frac {7} {102} = 0.069$

Using the formulas for the odds ration (OR) and relative risk (RR), we can compute that the relative risk (of lung cancer associated with heavy smoking) is:

$RR=\frac {0.184}{0.069} = 2.67.$

The risk of having lung cancer is more than 2.5 times greater for heavy smokers when compared to non-heavy smokers. Hence RR can also be used as proof of association between heavy smoking and lung cancer. For the same example, the odds ratio (OR) of lung cancer relative to heavy smoker is:

$$ OR = \frac{\frac{P \left( A \mid B \right)}{1 - P \left( A \mid B \right)}}{\frac{P \left( A \mid C \right)}{1 - P \left( A \mid C \right)}} = \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = \frac{0.225}{0.074}= 3.04 $$

Thus, the odds of having lung cancer is about 3 times greater for heavy smokers when compared to non-heavy smokers. Hence OR can also be used as proof of association between heavy smoking and lung cancer.

However, this is not sufficient for the proof of causality between lung cancer and heavy smoking. To address causality, we need to refer to Hill’s criteria for causation, which are a group of minimal conditions necessary to provide adequate evidence of a causal relationship between an incidence and a consequence, established by the English epidemiologist Sir Austin Bradford Hill in 1965. The list of criteria includes: strength; consistency; specificity; temporality; biological gradient; plausibility; coherence; experiment; analogy. (The specificity of each criteria is introduced in the theory part below).



Statistical measures as RR and OR can be calculated as proof of association between events.

Factor 1 Total
Yes No
Factor 2 Yes $n_{1,1}$ $n_{1,2}$ $n_{1,1} + n_{1,2}$
No $n_{2,1}$ $n_{2,2}$ $n_{2,1} + n_{2,2}$
Total $n_{1,1} + n_{2,1}$ $n_{2,1} + n_{1,2}$ $N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

The odds of having factor 1 is OR times greater for people with factor 2 compared to people without factor 2.

The risk of having factor 1 with people with factor 2 is RR times that risk of having factor 1 with people without factor 2.

When OR or RR is not significantly different from 1, we can see that factor 1 and factor 2 are positively associated (>1) or negatively associated (<1).

Halpern-Pearl’s Definition of Causality

In Causes and Explanations: A Structural-Model Approach, Halpern and Pearl gave a definition of actual causality based on the language of structural equations.

Definition of Actual Cause (AC): $\overrightarrow {X} =\overrightarrow{x}$ is an actual cause of $\phi$ in $(M,\overrightarrow{u})$ if the following three conditions hold:

  • AC1. $(M,\overrightarrow{u})⊨(\bar {X} ,\bar{x})\Lambda\phi$. That is, both $\overrightarrow{X}=\overrightarrow{x}$ and $\phi$ are true in the actual world.
  • AC2. There exists a partition $(\overrightarrow{Z},\overrightarrow{x})$ of $ν$ with $\overrightarrow{X} ⊆ \overrightarrow{ Z}$ and some settings $(\overrightarrow{x},\overrightarrow{ω})$ of the variable with $(\overrightarrow{X},\overrightarrow{W})$ such that if $(\overrightarrow{X},\overrightarrow{u})⊨Z=z^*$ for $Z \in \overrightarrow{Z}$, then
(a) $(M,\overrightarrow{u})⊨[\overrightarrow{X} ← \overrightarrow{x},\overrightarrow{W} ← \overrightarrow{ω},\overrightarrow{Z'} ← \overrightarrow{z^*}]¬φ$. In words, changing $(\overrightarrow{X},\overrightarrow{W})$ from $(x,ω)$ to $(\overrightarrow{x'},\overrightarrow{ω'})$ changes $φ$ from the true to false.
(b) $(M,\overrightarrow{u})⊨[\overrightarrow{X} ← \overrightarrow{x},\overrightarrow{W} ← \overrightarrow{ω},\overrightarrow{Z'} ← \overrightarrow{z^*}]φ$ for all subsets $\overrightarrow{Z'}$ of $\overrightarrow{Z}$. In words, setting $\overrightarrow{W}$ to $\overrightarrow{ω}$ should have no effect on $φ$ as long as $\overrightarrow{X}$ is kept at its current value $\overrightarrow{x}$, even if all the variables in an arbitrary subset of $\overrightarrow{Z}$ are set to their original values in the context $\overrightarrow{u}$.
  • AC3. $\overrightarrow{X}$ is minimal; no subset of $\overrightarrow{X}$ satisfies conditions AC1 and AC2. Minimality ensures that only those elements of the conjunction $\overrightarrow{X}=\overrightarrow{x}$ that are essential for changing $φ$ in AC2(a) are considered part of a cause; inessential elements are pruned. The types of events that we allow as actual causes are ones of the form $X_1=x_1 ⋀ … ⋀X_k = x_k$, that is, conjunctions of primitive events; this is abbreviated as $X=\overrightarrow{x}$.
  • Example: Suppose that there was a heavy rain in April and electrical storms in the following two months; and in June the lightning took hold. If it hadn’t been for the heavy rain in April, the forest would have caught fire in May. The question is whether the April rains caused the forest fire. According to a naive counterfactual analysis, they do, since if it hadn’t rained, there wouldn’t have been a forest fire in June.
This is unacceptable. A good enough story of events and of causation might give us reason to accept some things that seem intuitively to be false, but no theory should persuade us that delaying a forest’s burning for a month (or indeed a minute) is causing a forest fire.
In our framework, as we now show, it is indeed false to say that the April rains caused the fire, but they were a cause of there being a fire in June, as opposed to May. This seems to us intuitively right. To capture the situation, it suffices to use a simple model with three endogenous random variables:
  • AS for “April showers”, with two values—0 standing for did not rain heavily in April and 1 standing for rained heavily in April;
  • ES for “electric storms”, with four possible values: (0,0) (no electric storms in either May or June), (1,0) (electric storms in May but not June), (0,1) (storms in June but not May), and (1,1) (storms in April and May);
  • And F for “fire”, with three possible values: 0 (no fire at all), 1 (fire in May), or 2 (fire in June).
We do not describe the context explicitly, either here or in the other examples. Assume its value is such that it ensures that there is a shower in April, there are electric storms in both May and June, there is sufficient oxygen, there are no other potential causes of fire (like dropped matches), no other inhibitors of fire (alert campers setting up a bucket brigade), and so on. That is, we choose so as to allow us to focus on the issue at hand and to ensure that the right things happened (there was both fire and rain).
Avoiding writing out the details of the structural equations—they should be obvious, given the story (at least, for the context $\overrightarrow{u}$; this is also the case for all the other examples in this section. The causal network is simple: there are edges from AS to F and from ES to F. It is easy to check that each of the following hold.
  • AS = 1 is a cause of the June fire (F = 2) (taking $\overrightarrow{W}={ES}$ and $\overrightarrow{Z}={AS,F}$) but not of fire ($F=2 ⋁ F=1$).
  • ES = (1,1) is a cause of the both $F=2$ and ($F=1 ⋁ F=2$). Having electric storms in both May and June caused there to be a fire.
  • $AS=1∧ES=(1,1)$ is not a cause of $F=2$, because it violates the minimality requirement of AC3; each conjunct alone is a cause of $F=2$. Similarly, $AS=1∧ES=(1,1)$ is not a cause of (F=1⋁F=2).
The distinction between April showers being a cause of the fire (which they are not, according to our analysis) and April showers being a cause of a fire in June (which they are) is one that seems not to have been made in the discussion of this problem; nevertheless, it seems to us an important distinction.

Hill’s criteria for causality

  • Strength: A small association does not mean that there is not a causal effect, though the larger the association, the more likely that is causal.
  • Consistency: Consistent findings observed by different persons in different places with different sample strengthen the likelihood of an effect.
  • Specificity: Causation is likely if a very specific population at a specific site and disease with no other likely explanation. The more specific an association between a factor and an effect is, the bigger the probability of a causal relationship.
  • Temporality: The effect has to occur after the cause (and if there is an expected delay between the cause and expected effect, then the effect must occur after that delay).
  • Biological gradient: Greater exposure should generally lead to greater incidence of the effect. However, in some cases, the mere presence of the factor can trigger the effect. In other cases, an inverse proportion is observed: greater exposure leads to lower incidence.
  • Plausibility: A plausible mechanism between cause and effect is helpful (but Hill noted that knowledge of mechanism is limited by current knowledge).
  • Coherence: Coherence between epidemiological and laboratory findings increases the likelihood of an effect. However Hill noted that lack of such laboratory evidence cannot nullify the epidemiological effect on associations.
  • Experiment: Occasionally it is possible to appeal to experimental evidence.
  • Analogy: The effect of similar factors may be considered.


  • This article reviews, from some important examples, the classical methodological approach for discussing causality in epidemiology. Coronary hear disease (CHD) prevention has largely benefited in the past from the development of epidemiological research, however, the opposition association-causation is currently raised from observational data. The easy identification of DNA polymorphisms has prompted new CHD etiological research in the past 10 years. Causality of the associations present some special characteristics when genes are involved: necessity of replication, Mendelian randomization, which might prove to be important in future research.
  • This article is by Hill, Austin Bradford and it talked about reasoning about causal evidence in analytical thinking and it gave a thorough explanation on each of the Hill’s criteria and also the implication of these criteria in combined thinking and real life conditions of passing from association to causation. This article is very insightful and worth reading.



After visiting your parents, your mother notes that you have been drinking a lot of black tea. She tells you that you should cut back to no more than one cup a day since caffeine is “bad for your health”. To prove her wrong (and justify your love of hot beverages), you do a literature review of the epidemiology and find evidence that tea drinkers have lower risk of type 2 diabetes. Before you go back to your mother with your findings, you decide to think carefully about the various lines of evidence to see if the data really seem causal.

Compose a plausible biological argument as to why coffee and/or tea would be protective against type 2 diabetes. (You might want to use online resources or be creative.)

You first identify a published ecological study that compared prevalent diabetes and black tea consumption in 50 countries. Their Figure 3 (shown below) illustrates their key finding, which suggests that there is lower type 2 diabetes prevalence with greater tea consumption. Comment on how strong of an argument this study makes for causality. Be sure to explain your reasoning and note which, if any, of Hill’s criteria are met with this study.



Next, you find a study of type 2 diabetes and tea consumption that was conducted using information from the Women’s Health Study, a large prospective randomized controlled trial evaluating the impacts of low-dose aspirin and vitamin E on cardiovascular disease. In the Women’s Health Study, self-reported tea consumption collected at baseline and type 2 diabetes was self-reported over the follow-up period. Personal information was also collected about other individual risk factors for type 2 diabetes. List three characteristics of this study that make it a stronger design to assess causality than the previous study.

The following table illustrates the study findings with respect to tea. Comment on their findings. Looking at the point estimates, comment if there is compelling evidence of a relationship between tea consumption and incident diabetes. If there is a relationship, could it be due to chance?

Table 3. Relative Risks(RRs) and 95% CIs of Type 2 Diabetes according to Categories of Various Flavonoid-Rich Food Groups among 38018 Women in the Women’s Health Study.

Variable Category of intake
1st(lowest) 2nd 3rd 4th(highest)
Tea None <1 cup/d 1-3 cups/d ≥4 cups/d
No of cases/Total 496/12279 686/15633 363/8344 48/1201
Adjusted for age and energy 1.00 1.08(0.96-1.21) 1.03(0.90-1.19) 0.92(0.68-1.26)
Multivariate Model2 1.00 1.07(0.95-1.20) 1.04(0.90-1.20) 0.73(0.52-1.01)
Multivariate Model3 1.00 1.07(0.95-1.21) 1.05(0.91-1.21) 0.72(0.52-1.01)

Note: 1. Test for trend based on ordinal variable containing median value for each quantile. 2. Multivariate model: adjusted for age (continuous), BMI (continuous), total energy intake (continuous), smoking (current, past and never), exercise (rarely/ never, <1 time/wk, 1-3 times/wk, and ≥ 4 times/wk), alcohol use (rarely/ never, 1-3 drinks/mo, 1-6 drinks/wk, and ≥ 1 drink/d), history of hypertension (yes/no), history of high cholesterol (yes/no), and family history of diabetes (yes/no). 3. Further adjustment for dietary intakes of fiber intake (quintiles), glycemic load (quintiles), magnesium (quintiles), and total fat (quintiles).

Looking more carefully at the table, you note that the authors report associations controlling for various factors in their statistical models. (This approach, like stratification, is used to control for confounding by including the factors in a multivariable regression model.) Reviewing the table, do you see evidence of confounding of the “crude” (age and energy adjusted model) relationship for persons drinking >4 cups of tea a day?

The associations with >4 cups of tea shown in Table 3 becomes stronger (further away from the null) after adjustment for BMI, exercise, and fiber. Given that tea drinkers have lower BMI, better exercise patterns, and healthier diets, why does this finding seem surprising?

The authors also examined associations between an antioxidant contained in tea and two subclinical markers of diabetes (HbA1C & insulin) but found no relationship. How does this information influence your assessment of the causal relationship between tea and type 2 diabetes?

Another European study explored incident type 2 diabetes and tea consumption. The following table presents the associations for consuming >0 to <1, 1 to <4, >4 cups of tea per day as compared to consuming no tea per day (categories in the column on the far left). What feature of their findings makes the association seem more likely to be causal?

Crude Model 1 Model 2 Model 3 Model 4
N total Cases Median HR 95% CI HR 95% CI HR 95% CI HR 95% CI HR 95% CI
0 9499 4389 0 1 1 1 1 1
0 – 1 7060 3197 0.23 0.89 (0.80, 0.99) 0.96 (0.81, 1.07) 0.96 (0.84, 1.01) 0.97 (0.85, 1.01) 1.03 (0.91, 1.16)
1 – 4 5751 2437 2.00 0.771 (0.66, 0.90) 0.85 (0.69, 0.99) 0.85 (0.71, 1.01) 0.84 (0.72, 0.98) 0.93 (0.81, 1.05)
≥4 3729 1518 6.84 0.63 (0.50, 0.80) 0.72 (0.52, 0.90) 0.72 (0.53, 0.96) 0.70 (0.53, 0.90) 0.84 (0.71, 1.00)
p-trend <0.01 <0.01 <0.01 <0.01 0.04

Note: HR and 95% CI were derived from the modified Cox proportional hazard model by age at baseline and are based on pooled estimates from country specified analyses using a random effects meta-analysis.

Model 1: sex, smoking status physical activity level and education level. Model 2: additional to model 1: intake of energy, protein, carbohydrates, saturated fatty acids, mono-unsaturated fatty acids, poly-unsaturated fatty acids, alcohol, and fiber. Model 3: additional to model 2: intake of coffee, juices, soft drinks, and milk. Model 4: additional to model 3: body mass index.

In both of the final two papers that you reviewed, individuals who failed to answer questions about their intake of tea and persons without BMI information were excluded. Discuss a situation in which this could introduce bias to your study.

Observing the data from these three different papers that you reviewed, comment on if you believe the relationship between tea consumption and type 2 diabetes to be causal.

What study might you purpose to do next to prove causality based on the Hill’s criteria?