I am working through the Andy Field textbook Discovering Statistics Using R and on p.899 came to a point where my R code was not returning the same results as shown in the book.
Specifically, the example deals with fitting a multilevel linear model to repeated measures data, and does so by slowly building up to the final model by creating simpler models one parameter at a time.
My issue stems from coercing the Time variable from a factor into a numeric, which is straightforward, but my solution involved imputing the Time integers as months (e.g., the data file uses "Satisfaction_6_Months" and "Satisfaction_12_Months") so I coded the four Time levels as c(0,6,12,18).
The text (actually the errata online) uses a different approach, coding the four Time levels as c(0,1,2,3). I would have expected that this difference between Time codes would have little impact on the data, but I am quite wrong.
The code below demonstrates how the final line in the anova (containing ARModel) is very different in these two situations.
I am trying to understand 1) why this difference is observed at all, 2) why it is only observed for this last point in the model building exercise, and 3) what it means in general for when I put this content into practice and might have good reason for adopting a month-wise coding structure.
First, run the full script to replicate the results. Then, uncomment the indicated line and run again.
library(reshape2)
data_url <- "https://studysites.sagepub.com/dsur/study/DSUR%20Data%20Files/Chapter%2019/Honeymoon%20Period.dat"
satisfactionData <- read.delim(data_url, header = TRUE)
restructuredData <- melt(satisfactionData,
id = c("Person", "Gender"),
measured = c("Satisfaction_Base", "Satisfaction_6_Months", "Satisfaction_12_Months", "Satisfaction_18_Months"))
names(restructuredData) <- c("Person", "Gender", "Time", "Life_Satisfaction")
restructuredData$Time<-as.numeric(restructuredData$Time)-1
# ***************************************************************
# On second pass, uncomment the line below and rerun whole script
# ***************************************************************
# restructuredData$Time = restructuredData$Time*6
# create models one parameter at a time
intercept <- gls(Life_Satisfaction~1,
data = restructuredData,
method = "ML",
na.action = na.exclude)
randomIntercept <- lme(Life_Satisfaction ~1,
data = restructuredData,
random = ~1|Person,
method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
timeRI <- update(randomIntercept, .~. + Time)
timeRS <- update(timeRI, random = ~Time|Person)
ARModel <- update(timeRS, correlation = corAR1(0, form = ~Time|Person))
anova(intercept, randomIntercept, timeRI, timeRS, ARModel)
# Pass #1 - with Time as 0,1,2,3
# Model df AIC BIC logLik Test L.Ratio p-value
# intercept 1 2 2064.053 2072.217 -1030.0263
# randomIntercept 2 3 1991.396 2003.642 -992.6978 1 vs 2 74.65704 <.0001
# timeRI 3 4 1871.728 1888.057 -931.8642 2 vs 3 121.66714 <.0001
# timeRS 4 6 1874.626 1899.120 -931.3131 3 vs 4 1.10224 0.5763
# ARModel 5 7 1872.891 1901.466 -929.4453 4 vs 5 3.73564 0.0533
# Pass #2 - with Time as 0,6,12,18
# Model df AIC BIC logLik Test L.Ratio p-value
# intercept 1 2 2064.053 2072.217 -1030.0263
# randomIntercept 2 3 1991.396 2003.642 -992.6978 1 vs 2 74.65704 <.0001
# timeRI 3 4 1871.728 1888.057 -931.8642 2 vs 3 121.66714 <.0001
# timeRS 4 6 1874.627 1899.120 -931.3135 3 vs 4 1.10151 0.5765
# ARModel 5 7 1876.627 1905.203 -931.3135 4 vs 5 0.00001 0.9978
This isn't really a programming question, but let's go.
The assumption behind a first-order autoregressive covariance is that each timepoint is correlated with the others via a single parameter (
Phiin the output ofnlme::lme), and this correlation regresses proportionally to the distance in time. Specifically, if the correlation betweentandt+1isPhi, the correlation betweentandt+2isPhi^2(smaller, because Phi is almost always <1). This generalizes tocor(t, t+x) = Phi^x.You can actually see this in action if you look at the (initial) correlation matrices of these two models. I'm not using the default
Phi=0here because then you'll end up with a matrix full of zeroes:Notice how, for the same
Phi, increasing the time scale by 6 reduces the (assumed) correlation between otherwise identical observations by a power of 6, or0.80^6 ~ 0.26. This is different from the effects that you added in previous steps, because there your fitted parameters and their variance will simply be six times larger (it's essentially a change in unit, the t/F statistics will remain identical).The fitted
Phiin the 0:3 scale is 0.215, so in the six-fold time scale that would correspond to0.215^6 ~ 1e-4, very close to zero. Combined with the default initialized value of actually zero the model with the six-fold time fits a correlation of zero. This is why only the last step differs, and why those two models aren't different in the ANOVA in the six-fold scale.I also want to highlight here that this is quite sensitive to the initialization value in
corAR1: making it non-zero will allow the second model to fit a non-zero correlation, though it seems to be quite sensitive to the value you do specify. All of this might point towards the correlation actually being quite low and/or AR1 not being the most appropriate covariance structure.Finally, note that this is also a result of how the correlation matrix was specified; another way would be to specify
Phias the correlation between two adjacent timepoints regardless of their distance (i.e. only assuming they are all equidistant which the original model basically did). You can do this by changingform = Time|Persontoform = 1|Person. The fitting routine now assumes that your data was provided in order of time within group, and probably also that all groups contain an equal number of observations or at least also intermediate missing values. Each record within group then takes a next row/column in the correlation matrix, without taking into account how far apart they might be (other than the chronology which you set by sorting). In this way the two models will again be identical.