# SMHS LongitudinalData

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

## Scientific Methods for Health Sciences - Longitudinal Data

### Overview

Longitudinal data are referred to data collected from a large population over a given time period where the same subjects are measured at multiple points in time. It is widely used in statistical and financial studies. In this section, we are going to introduce to the concept of longitudinal data.

### Motivation

Data measured at multiple points in time on the same subject are commonly used in various studies. For example, consider a dataset contain students’ standard test score in four successive years (attached in the table). How do we categorize data like this?

 Student Name Grade 1 (2010) Raw Score Grade 2 (2011) Raw Score Grade 3 (2012) Raw Score Grade 4 (2013) Raw Score Tom 339 350 361 366 Mike 332 343 350 351 Vivian 360 380 400 420

### Theory

1) Longitudinal data Longitudinal data is data collected from a large population over a given time period where the same subjects are measured at multiple points in time.

• The primary advantage of longitudinal dataset is that they can measure change. Hence, we can estimate. For example, for the dataset listed above, we can estimate the effect of various factors on improvement in students’ achievement. Or, we can estimate the overall effectiveness of individual teachers by examining the performance of successive classes of students they teach.
• The longitudinal data extend into both the past and the present, which enable us to evaluate the effect of a specific policy by looking at student performance before or after the policy was introduced.
• Longitudinal data also allows us to use sophisticated analytic strategies to measure the impact of various policies with reasonable precision.

2) Longitudinal study A longitudinal study is an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. It generally yields multiple (repeated) measurements on each subject, which are correlated within subjects and thus require special statistical techniques for analysis and inference. In longitudinal study, we may also be interested in measuring the time until a key clinical event such as time to death, where survival analysis is generally applied for analysis.

• It plays an important role in various studies like epidemiology, clinical research and biological studies. Longitudinal studies are used to characterize normal growth and aging to assess the effect of risk factors on human health to evaluate the effectiveness of treatments and involve lots of efforts.
• Benefits of longitudinal studies: (1) incident events are recorded, a prospective longitudinal study measures the new occurrence of disease; (2) prospective ascertainment of exposure, data recorded at multiple follow-up visits may recall bias; (3) measurement of individual change in outcomes; (4) separation of time effects: cohort, period, age; (5) control for cohort effects, in a longitudinal study, the cohort under study is fixed and thus changes in time are not confounded by cohort differences.
• Challenges of longitudinal studies: (1) participant follow-up, there is risk of bias due to incomplete follow-up of study participants; (2) analysis of correlated data, if intra-subject correlation of response measurements is ignored, then inferences such as statistical tests or confidence interval can be grossly invalid; (3) time-varying covariate, the direction of causality can be complicated by ‘feedback’ between the outcome and the exposure.

3) Notations Notations use $Y_{ij}$ to denote the outcome measured on subject $i$ at time $t_{ij},$ where $i=1, \cdots$, $N$ is index for subject, and $j = 1$, $\cdots$, $n$ is index for observations within a subject. The measurement time follows a common set of follow-up times $t_{ij} = t_{j}$. Use $X_{ij}$ to denote covariates associated with observations $Y_{ij}$. Common covariates in a longitudinal study include the time, $t_{ij}$, and person-level characteristics such as age, treatment assignments, and etc. In many cases, the scientific studies focus on mean responses as a function of covariates such as treatment and time, we can also make statistical inference on within-person correlation of observation. Define $\rho_{jk} = corr(Y_{ij}, Y_{ik}),$ the within-subject correlation between observations at time $t_{j} and t_{k}.$

4) Exploratory data analysis This is used to discover patterns of systematic variation across groups, as well as aspects of random variation among individual patients.

• Group means: if we are looking to measure the average response over time, statistical measures like means and standard deviation, which can reveal whether different groups change in a similar or different fashion.
• Variation among individuals: single variance parameter can be used to summarize uncertainty or variability in a response measurement. In longitudinal data, ‘distance’ between measurements on different subjects is usually expected to be greater than the distance between repeated measurements taken on the same subject, hence though the total variance can be noted as $\sigma^{2} =1/2 E[(Y_{ij}-Y_{i’j})^{2}]$ assuming $E(Y_{ij}) = E(Y_{i’j}) = \mu,$ the expected variation for two measurements taken on the same person but at time $t_{j}$ and $t_{k}$ may not equal the total variation of $\sigma^{2}$ since the measurements are correlated:$\sigma^{2}(1-\rho_{jk})=1/2 E[(Y_{ij}-Y_{ik})^{2}]$ assuming $E(Y_{ij}) = E(Y_{ik}) = \mu.$ When $\rho_{jk} > 0,$ between-subject variation is greater than within-subject variation.
• Characterizing correlation and covariance: characterizing the correlation is useful for understanding components of variation and for identifying a variance or correlation model for regression methods such as GEE (generalized estimating equations) or mixed-effects models. Covariance can be written in terms of $\sigma_{j}^2$ and the correlation $\rho_{jk}: cov(Y_{i})=\begin{bmatrix} \sigma_{1}^2 & \sigma_{1}\sigma_{2}\rho_{12} &\cdots &\sigma_{1}\sigma_{n}\rho_{1n} \\ \sigma_{2}\sigma_{1}\rho_{21}&\sigma_{2}^2 &\cdots &\sigma_{2}\sigma_{n}\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ \sigma_{n}\sigma_{1}\rho_{n1}&\sigma_{n}\sigma_{2}\rho_{n2} &\cdots &\sigma_{n}^2 \end{bmatrix}; the correlation matrix: cov(Y_{i})=\begin{bmatrix} 1 & \rho_{12} &\cdots &\rho_{1n} \\\rho_{21}&1 & \cdots&\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\\rho_{n1}&\rho_{n2} &\cdots &1 \end{bmatrix}.$

### Applications

1) This article proposed an extension of generalized linear models to the analysis of longitudinal data. This paper introduced a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations were derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory was presented for the general class of estimators. Specific cases in which are assumed independence, m-dependence and exchangeable correlation structures from each subject were discussed. Efficiency of the proposed estimators in two simple situations was considered.

2) This article reviewed statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies were discussed. Three approaches, marginal, transition and random effects models, were presented with emphasis on the distinct interpretations of their coefficients in the discrete data case. This paper reviewed generalized estimating equations for inferences about marginal models. The ideas were illustrated with analyses of a 2 × 2 crossover trial with binary responses and a randomized longitudinal study with a count outcome.

### Software

In R:  package of ‘longitudinal’


### Problems

install.packages('longitudinal')
library(longitudinal)
data(tcell)
is.longitudinal(tcell.34)
TRUE
summary(tcell.34)
Longitudinal data:
58 variables measured at 10 different time points
Total number of measurements per variable: 340
Repeated measurements: yes

To obtain the measurement design call 'get.time.repeats()'.
plot(tcell.10,1:9)   ## plot the first 9 time series of the data


dim(tcell.34)  ## dataset with 34 repeats
[1] 340  58
get.time.repeats(tcell.34)
$\$ $time [1] 0 2 4 6 8 18 24 32 48 72 $\repeats

[1] 34 34 34 34 34 34 34 34 34 34

is.equally.spaced(tcell.34)
[1] FALSE
is.regularly.sampled(tcell.34)
[1] TRUE
has.repeated.measurements(tcell.34)
[1] TRUE
condense.longitudinal(tcell.34,1:2,mean) # compute mean value at each time point for the first two gene
RB1    CCNG1
[1,] 17.41394 16.48101
[2,] 17.62637 16.34122
[3,] 17.89343 16.26661
[4,] 17.37539 15.91950
[5,] 17.57670 16.25621
[6,] 17.85805 16.26411
[7,] 17.76270 16.24127
[8,] 17.22543 16.52049
[9,] 16.86306 16.14295


[10,] 17.09348 16.58913 has.repeated.measurements(tcell.34) TRUE

Sorry, I failed to find a SOCR dataset to fit here…