Difference between revisions of "SMHS TimeSeriesAnalysis"
(→Auto-regressive Integrated Moving Average (ARIMA) Models4) |
(→Automated Forecasting) |
||
Line 528: | Line 528: | ||
<mark> CHECK FORMATTING!!! </mark> | <mark> CHECK FORMATTING!!! </mark> | ||
− | + | <li>The '''forecast''' package provides functions for the automatic selection of exponential and ARIMA models. The '''ets()''' (exponential TS) function supports both additive and multiplicative models. The '''auto.arima()''' function accounts for seasonal and nonseasonal ARIMA models according to criteria maximizing a cost function.</li> | |
# library(forecast) | # library(forecast) |
Revision as of 15:44, 16 May 2016
Scientific Methods for Health Sciences - Time Series Analysis
Questions
Time series analysis represents a class of statistical methods applicable for series data aiming to extract meaningful information, trend and characterization of the process using observed longitudinal data. These trends may be used for time series forecasting and for prediction of future values based on retrospective observations. Note that classical linear modeling (e.g., regression analysis) may also be employed for prediction & testing of associations using the values of one or more independent variables and their effect on the value of another variable. However, time series analysis allows dependencies (e.g., seasonal effects to be accounted for).
Time-series representation
There are 3 (distinct and complementary) types of time series patterns that most time-series analyses are trying to identify, model and analyze. These include:
to the additive model:
We can examine the Seasonal trends by decomposing the Time Series by loess (Local Polynomial Regression) Fitting into Seasonal, Trend and irregular components using Loess - Local Polynomial Regression Fitting (stl function, in the default “stats” package):
# using Monthly Males Deaths from Lung Diseases in UK from bronchitis, emphysema and asthma, 1974–1979 mdeaths # is.ts(mdeaths) fit <- stl(mdeaths, s.window=5) plot(mdeaths, col="gray", main=" Lung Diseases in UK ", ylab=" Lung Diseases Deaths", xlab="") lines(fit\$\$$time.series[,2],col="red",ylab="Trend") plot(fit) # data, seasonal, trend, residuals <center><b>“stl” function parameters</b></center> <center> {| class="wikitable" style="text-align:center; " border="1" |- |x||Univariate time series to be decomposed. This should be an object of class "ts" with a frequency greater than one. |- |s.window||either the character string "periodic" or the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7, according to Cleveland et al. This has no default. |- |s.degree||degree of locally-fitted polynomial in seasonal extraction. Should be zero or one. |- |t.window||the span (in lags) of the loess window for trend extraction, which should be odd. If NULL, the default, nextodd(ceiling((1.5*period) / (1-(1.5/s.window)))), is taken. |- |t.degree||degree of locally-fitted polynomial in trend extraction. Should be zero or one. |- |l.window||the span (in lags) of the loess window of the low-pass filter used for each subseries. Defaults to the smallest odd integer greater than or equal tofrequency(x) which is recommended since it prevents competition between the trend and seasonal components. If not an odd integer its given value is increased to the next odd one. |- |l.degree||degree of locally-fitted polynomial for the subseries low-pass filter. Must be 0 or 1. |- |s.jump, t.jump, l.jump||integers at least one to increase speed of the respective smoother. Linear interpolation happens between every *.jump<sup>th</sup> value. |- |robust||logical indicating if robust fitting be used in the loess procedure. |- |inner||integer; the number of ‘inner’ (backfitting) iterations; usually very few (2) iterations suffice. |- |outer||integer; the number of ‘outer’ robustness iterations. |- |na.action||action on missing values. |} </center> [[Image:SMHS_TimeSeries3.png|400px]] [[Image:SMHS_TimeSeries4.png|400px]] <b>monthplot</b>(fit$\$$time.series[,"seasonal"], main="", ylab="Seasonal", lwd=5) #As the “fit <- stl(mdeaths, s.window=5)” object has 3 time-series components (seasonal; trend; remainder) #we can alternatively plot them separately: #monthplot(fit, choice = "seasonal", cex.axis = 0.8) #monthplot(fit, choice = "trend", cex.axis = 0.8) #monthplot(fit, choice = "remainder", type = "h", cex.axis = 1.2) # histogramatic
These are the seasonal plots and seasonal sub-series plots of the seasonal component illustrating the variation in the seasonal component over time (over the years).
Using historical weather (average daily temperature at the University of Michigan, Ann Arbor): [1] (See meta-data description and provenance online: [2]).
Year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2015 | 26.3 | 14.4 | 34.9 | 49 | 64.2 | 68 | 71.2 | 70.2 | 68.7 | 53.9 | NR | NR |
2014 | 24.4 | 19.4 | 29 | 48.9 | 60.7 | 69.7 | 68.8 | 70.8 | 63.2 | 52.1 | 35.4 | 33.3 |
2013 | 22.7 | 26.1 | 33.3 | 46 | 63.1 | 68.5 | 72.9 | 70.2 | 64.6 | 53.2 | 37.6 | 26.7 |
2012 | 22.4 | 32.8 | 50.7 | 49.2 | 65.2 | 71.4 | 78.9 | 72.2 | 63.9 | 51.7 | 39.6 | 34.8 |
... | ||||||||||||
... | 17 | 15.3 | 31.4 | 47.3 | 57 | 69 | 76.6 | 72 | 63.4 | 52.2 | 35.2 | 23.7 |
1900 | 21.4 | 19.2 | 24.7 | 47.8 | 60.2 | 66.3 | 72 | 75.4 | 67.2 | 59 | 37.6 | 29.2 |
# data: 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015.csv # more complete data is available here: 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015.xls umich_data <- read.csv("https://umich.instructure.com/files/702739/download?download_frd=1", header=TRUE) head(umich_data) # https://cran.r-project.org/web/packages/mgcv/mgcv.pdf # install.packages("mgcv"); require(mgcv) # install.packages("gamair"); require(gamair) par(mfrow=c(1,1))
The data are in wide format – convert to long format for plotting
# library("reshape2") long_data <- melt(umich_data, id.vars = c("Year"), value.name = "temperature") l.sort <- long_data[order(long_data$\$$Year),] head(l.sort); tail(l.sort) plot(l.sort$\$$temperature, data = l.sort, type = "l")
Fit the GAMM Model (Generalized Additive Mixed Model)
Fit a model with trend and seasonal components --- computation may be slow:
# define the parameters controlling the process of model-fitting/parameter-estimation ctrl <- list(niterEM = 0, msVerbose = TRUE, optimMethod="L-BFGS-B") # First try this model mod <- gamm(as.numeric(temperature) ~ s(as.numeric(Year)) + s(as.numeric(variable)), data = l.sort, method = "REML", correlation=corAR1(form = ~ 1|Year), knots=list(Variable = c(1, 12)), na.action=na.omit, control = ctrl)
#Correlation: corStruct object defineing correlation structures in lme. Grouping factors in the formula for this
#object are assumed to be nested within any random effect grouping factors, without the need to make this
#explicit in the formula (somewhat different from the behavior of lme).
#This is similar to the GEE approach to correlation in the generalized case.
#Knots: an optional list of user specified knot values to be used for basis construction --
#different terms can use different numbers of knots, unless they share a covariate.
#If you revise the model like this (below), it will compare nicely with 3 ARMA models (later)
mod <- gamm(as.numeric(temperature) ~ s(as.numeric(Year), k=116) + s(as.numeric(variable), k=12), data = l.sort, correlation = corAR1(form = ~ 1|Year), control = ctrl)
Summary of the fitted model:
summary(mod$\$$gam) <b>Visualize the model trend (year) and seasonal terms (months)</b> plot(mod$\$$gam, pages = 1) t <- cbind(1: 1392) # define the time
Plot the trend on the observed data -- with prediction:
pred2 <- predict(mod$\$$gam, newdata = l.sort, type = "terms") ptemp2 <- attr(pred2, "constant") + <u>pred2[,1]</u> <b># pred2[,1] = trend; pred2[,2] = seasonal effects</b> <b># mod$\$$gam is a GAM object containing information to use predict, summary and print methods, but not to use e.g. the anova method function to compare models plot(temperature ~ t, data = l.sort, type = "l", xlab = "year", ylab = expression(Temperature ~ (degree*F))) lines(ptemp2 ~ t, data = l.sort, col = "blue", lwd = 2)
Plot the seasonal model
pred <- predict(mod$\$$gam, newdata = l.sort, type = "terms") ptemp <- attr(pred, "constant") + <u>pred[,2]</u> plot(l.sort$\$$temperature ~ t, data = l.sort, type = "l", xlab = "year", ylab = expression(Temperature ~ (degree*F))) lines(ptemp, data = l.sort, col = "red", lwd = 0.5)
Zoom in first 100 temps (1:100)
plot(l.sort$\$$temperature ~ t, data = l.sort, type = "l", <b>xlim=c(0, 120)</b>, xlab = "year", ylab = expression(Temperature ~ (degree*F))); lines(ptemp, data = l.sort, col = "red", lwd = 0.5) [[Image:SMHS_TimeSeries10.png|500px]] To examine how much the estimated trend has changed over the 116 year period, we can use the data contained in <b>pred</b> to compute the difference between the start (Jan 1900) and the end (Dec 2015) of the series in the <i><u>trend</u></i> component only: <b>tail(pred[,1], 1) - head(pred[,1], 1)</b> # subtract the predicted temp [,1] in 1900 (head) from the temp in 2015 (tail) # names(attributes(pred)); str(pred) # to see the components of the GAM prediction model object (pred) <b>Assess autocorrelation in residuals</b> # head(umich_data); tail(umich_data) acf(resid(mod$\$$lme), lag.max = 36, main = "ACF") # acf = Auto-correlation and Cross-Covariance Function computes and plots the estimates of the autocovariance or autocorrelation function. # pacf is the function used for the partial autocorrelations. # ccf computes the cross-correlation or cross-covariance of two univariate series. pacf(resid(mod$\$$lme), lag.max = 36, main = "pACF") Looking at the residuals of this model, using the (partial) autocorrelation function, we see that there may be some residual autocorrelation in the data that the trend term didn’t account for. The shapes of the ACF and the pACF suggest an <b>AR(p)</b> model might be appropriate. <b>Fit and compare 4 alternative autoregressive models (original mod, AR1, AR2 and AR3)</b> ## AR(1) m1 <- gamm(as.numeric(temperature) ~ s(as.numeric(Year), k=116) + s(as.numeric(variable), k=12), data = l.sort, correlation = corARMA(form = ~ 1|Year, <b><u>p = 1</u></b>), control = ctrl) ## AR(2) m2 <- gamm(as.numeric(temperature) ~ s(as.numeric(Year), k=116) + s(as.numeric(variable), k=12), data = l.sort, correlation = corARMA(form = ~ 1|Year, <b><u>p = 2</u></b>), control = ctrl) ## AR(3) m3 <- gamm(as.numeric(temperature) ~ s(as.numeric(Year), k=116) + s(as.numeric(variable), k=12), data = l.sort, correlation = corARMA(form = ~ 1|Year, <b><u>p = 3</u></b>), control = ctrl) Note what the correlation argument is specified by <b>corARMA(form = ~ 1|Year, p = x)</b>, which fits an ARMA (auto-regressive moving average) process to the residuals, where <b>p</b> indicates the order for the <b>AR</b> part of the ARMA model, and <b>form = ~ 1|Year</b> specifies that the ARMA is nested within each year. This may expedite the model fitting but may also hide potential residual variation from one year to another. Let’s compare the candidate models by using the generalized likelihood ratio test via the <b>anova()</b> method for <b>lme</b> objects, see our previous mixed effects modeling notes <sup>1</sup> , <sup>2</sup>. This model selection is justified as we work with nested models -- going from the AR(3) to the AR(1) by setting some of the AR coefficients to 0. The models also vary in terms of the coefficient estimates for the splines terms which may require fixing some values while choosing the AR structure. <b><center>anova(mod$\$$lme, m1$\$$lme, m2$\$$lme, m3$\$$lme)</center></b> <center> {| class="wikitable" style="text-align:center; " border="1" |- |||Model||df||AIC||BIC||logLik||Test||L.Ratio||p-value |- |mod$\$$lme||1||7||7455.609||7492.228|| -3720.805|| || ||
|- |m1$\$$lme||2|| 7||7455.609||7492.228|| -3720.805|| || || |- |m2$\$$lme||3|| 8||7453.982||7495.832|| -3718.991||2 vs 3||3.627409||0.0568 |- |m3$\$$lme||4|| 9||7455.966||7503.048|| -3718.983|| 3 vs 4||0.015687||0.9003 |} </center> ===='"`UNIQ--h-3--QINU`"'Interpretation==== The AR(1) model (m1) does not provide a substantial increase in fit over the naive model (mod), and the AR(2) model (m2) only provides a marginal increase in the AR(1) model fit (m1). There is no improvement in moving from m2 to AR(3) model (m3). Let’s plot the AR(2) model (m2) to inspect how over-fitted the naive model with uncorrelated errors was in terms of the trend term, which shows similar smoothness compared to the initial (mod) model. plot(m2$\$$gam, scale = 0) # plot(mod2$\$$gam, scale = 0) # “scale=0” ensures optimal y-axis cropping of plot [[Image:SMHS_TimeSeries11.png|500px]] <b>Investigation of residual patterns</b> layout(matrix(1:2, ncol = 2)) # original (mod) model acf(resid(mod$\$$lme), lag.max = 36, main = "ACF"); pacf(resid(mod$\$$lme), lag.max = 36, main = "pACF") # pACF controls for the values of the time series at all shorter lags, which contrasts the ACF which does not control for other lags. [[Image:SMHS_TimeSeries12.png|500px]] This illustrates that there is some (month=1) Auto-correlation (ACF) and partial auto correlation in the residuals. # ARM(2) model (m2) layout(matrix(1:2, ncol = 2)) res <- resid(m2$\$$lme, type = "normalized");
acf(res, lag.max = 36, main = "ACF - AR(2) errors"); pacf(res, lag.max = 36, main = "pACF- AR(2) errors") layout(1)
No residual auto-correlation remains in m2. The resulting fitted Generalized Additive Mixed Model (GAMM) object contains information about the trend and the contributions to the fitted values. The package mgcv3 can spit the information using predict() for each of the 4 models.
# require(mgcv); require(gamair) # m2 <- gamm(as.numeric(temperature) ~ s(as.numeric(Year), k=116) + s(as.numeric(variable), k=12), data = l.sort, correlation = corARMA(form = ~ 1|Year, p = 2), control = ctrl) pred2 <- predict(m2$\$$gam, newdata = l.sort, type = "terms") pred_trend2 <- attr(pred2, "constant") + <u>pred2[,1]</u> <b># trend</b> pred_season2 <- attr(pred2, "constant") + <u>pred2[,2]</u> <b># seasonal</b> effects # plot(m2$\$$gam, scale = 0) # plot pure effects # Convert the 2 columns (Year and Month/variable) to R Date object # df_time <- as.Date(paste(as.numeric(l.sort$\$$Year), as.numeric(l.sort$\$$variable), "1", sep="-")); df_time plot(x=df_time, y=l.sort$\$$temperature, data = l.sort, type = "l", xlim=c(as.Date("1950-02-01"),as.Date("1960-01-01")), xlab = "year", ylab = expression(Temperature ~ (degree*F))) lines(x=df_time, y=pred_trend2, data = l.sort, col = "red", lwd = 2); lines(x=df_time, y=pred_season2, data = l.sort, col = "blue", lwd = 2) [[Image:SMHS_TimeSeries14.png|500px]] ==='"`UNIQ--h-4--QINU`"'Moving average smoothing=== <li>A moving average of order $m=2k+1$ can be expressed as:</li> <center> $T_{t}=\frac{1}{2k+1}\sum_{j=-k}^{k}Y_{t+j}$ . </center> <li>The ''m''-MA represents an order m moving average, $T_t$, or the estimate of the trend-cycle at time ''t'', obtained by averaging values of the time series within ''k'' periods (left and right) of ''t''. This averaging process denoises the data (eliminates randomness in the data) and produces a smoother trend-cycle component.</li> <li>The 5-MA contains the values of $T_t$ with ''k''=2. To see what the trend-cycle estimate looks like, we plot it along with the original data</li> # print the moving average results (k=3 ↔ m=7) # library("forecast") plot(l.sort$\$$temperature, data = l.sort, type = "l", main=" UMich/AA Temp (1900-2015) ", ylab=" Temperature (F)", xlab="Year") lines(ma(l.sort$\$$temperature, 12), col="red", lwd=5) lines(ma(l.sort$\$$temperature, 36), col="blue", lwd=3) legend(0, 80, # places a legend at the appropriate place c("Raw", "k=12 smoother", "k=36 smoothest"), # puts text in the legend lty=c(1,1,1), # gives the legend appropriate symbols (lines) cex=1.0, # label sizes lwd=c(2.5,2.5), col=c("black", "red", "blue")) # gives the legend lines the correct color and width
Simulation of a time-series analysis and prediction
(1) Simulate a time series # the ts() function converts a numeric vector into an R time series object. # format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation # and frequency is the number of observations per unit time (1=annual, 4=quarterly, 12=monthly, etc.) Note that ling Rate = $\frac{1}{Frequency}$ # save a numeric vector containing 16-years (192 monthly) observations # from Jan 2000 to Dec 2015 as a time series object sim_ts <- ts(as.integer(runif(192,0,10)), start=c(2000, 1), end=c(2015, 12), frequency=12) sim_ts # subset the time series (June 2014 to December 2015) sim_ts2 <- window(sim_ts, start=c(2014, 6), end=c(2015, 12)) sim_ts2 # plot series plot(sim_ts) lines(sim_ts2, col="blue", lwd=3)
Seasonal Decomposition
# Seasonal decomposition fit_stl <- stl(sim_ts, s.window="period") # Seasonal Decomposition of Time Series by Loess plot(fit_stl) # inspect the distribution of the residuals hist(fit_stl$\$$time.series[,3]); # this contains the residuals: fit_stl$\$$time.series [,"remainder"], or seasonal, trend
# additional plots monthplot(sim_ts) # plots the seasonal subseries of a time series. For each season, a time series is plotted. # library(forecast) seasonplot(sim_ts)
Exponential Models
# simple exponential - models level fit_HW <- HoltWinters(sim_ts, beta=FALSE, gamma=FALSE) # double exponential - models level and trend fit_HW2<- HoltWinters(sim_ts, gamma=FALSE) # triple exponential - models level, trend, and seasonal components fit_HW3 <- HoltWinters(sim_ts) plot(fit_HW, col='black') par(new=TRUE) plot(fit_HW2, ann=FALSE, axes=FALSE, col='blue') par(new=TRUE) plot(fit_HW3, axes=FALSE, col='red') # clear plot: # dev.off()
Auto-regressive Integrated Moving Average (ARIMA) Models4
\begin{equation} X_t=\mu+ \sum_{i=1}^{p}φ_iX_{t-i} + \sum_{j=1}^{q}θ_jε_{t-j} + ε_t \end{equation}
Please Fix Formula/alignment !!! auto-regressive (p)part moving-average (q)part error-term
There are 2 types of ARIMA time-series models:
1) Non-seasonal ARIMA models, ARIMA(p, d, q), where parameters p, d, and q are positive integers,
p = order of the auto-regressive model, d = degree of differencing, when d=2, the dth difference is $(X_t-X_{t-1})-(X_{t-1}-X_{t-2})= X_t-2X_{t-1}+X_{t-2}$
That it, the second difference of X (d=2) is not the difference between the current period and the value 2 periods ago. It is the first-difference-of-the-first difference, the discrete analog of a second derivative, representing the local acceleration of the series rather than its local trend (first derivative).
q = order of the moving-average model.
2) Seasonal ARIMA models, ARIMA(p, d, q)(P, D, Q)m,
m = number of periods in each season, uppercase P, D, Q represent the auto-regressive, differencing, and moving average terms for the seasonal part of the ARIMA model, and the lower case (p,d,q) are as with non-seasonal ARIMA.
If 2 of the 3 terms are trivial, the model is abbreviated using the non-zero parameter, skipping the "AR", "I" or "MA" from the acronym. For example,
For more complex models:
The arima() function (stats package) can be used to fit an auto-regressive integrated moving averages model. Other useful functions include:
The forecast package has alternative versions of acf() and pacf() called Acf() and Pacf() respectively.
# fit an ARIMA(P, D, Q) model of order:
fit_arima1 <- arima(sim_ts, order=c(3, 1, 2))
# predictive accuracy library(forecast) accuracy(fit_arima1) # predict next 20 observations library(forecast) forecast(fit_arima1, 20) plot(forecast(fit_arima1, 20))
Automated Forecasting
CHECK FORMATTING!!!
# library(forecast) # Automated forecasting using an exponential model fit_ets <- ets(sim_ts) # Automated forecasting using an ARIMA model fit_arima2 <- auto.arima(sim_ts) # Compare the AIC (model quality) for both models fit_ets$\$$aic; fit_arima2$\$$aic accuracy(fit_ets); accuracy(fit_arima2);
Akaike’s Information Criterion (AIC) = -2Log(Likelihood)+2p, where p is he number of estimated parameters.
CHECK FORMATTING!!!
summary(fit_ets); summary(fit_arima2)
ACF plot of the residuals from the ARIMA(3,1,2) model shows all correlations within the threshold limits indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting the residuals are white noise.
# acf computes (and by default plots) estimates of the autocovariance or autocorrelation function acf(residuals(fit_ets))
# Box–Pierce or Ljung–Box test statistic for examining the null hypothesis of independence in a given time series. # These are sometimes known as ‘portmanteau’ tests. Box.test(residuals(fit_ets), lag=24, fitdf=4, type="Ljung") # plot forecast plot(forecast(fit_arima2)) # more on ARIMA https://www.otexts.org/fpp/8/7
1 https://umich.instructure.com/files/689861/download?download_frd=1
2 https://umich.instructure.com/courses/38100/files
3 https://cran.r-project.org/web/packages/mgcv/mgcv.pdf
See also
- SOCR Home page: http://www.socr.ucla.edu
Translate this page: