2 Using lavaan
to Run Regression and ANOVA as a SEM
This chapter prepares the reader to learn about structural equation modeling (SEM) by reviewing the fundamentals of ordinary least-squares (OLS) regression in the general(ized) linear modeling (GLM) framework. Some key take-home messages will reinforce what was discussed about covariance structure analysis in the previous chapter:
- Analyzing different data formats (raw vs. summary data) yields equivalent results (under conditions of normality and complete data).
- Recall from the previous chapter that maximum-likelihood estimation (MLE) in SEM chooses parameters that minimize discrepancies between observed and expected summary statistics rather than casewise observations.
- Grouping variables can be represented as dummy, effects, or contrast codes, but there are some advantages of stratifying results by group. This will be demonstrated using multigroup SEM and comparing results to the single-group approach.
- Point estimates from SEM will match GLM because OLS estimates are ML estimates when the distributional assumptions are met.
- independent, identically distributed (normal with homogeneous variance)
- However, the SEs are typically smaller under ML with SEM than under OLS with GLM.
- Furthermore, different (but analogous) test statistics will be compared between GLM (t and F statistics) and SEM (z and \(\chi^2\) statistics).
- Model comparisons require fitting nested models to the same data, which in SEM means fitting them to the same summary statistics.
2.1 Prepare Data and Workspace
2.1.1 Import Example Data
We will use data from a tutorial published on the Social Change Lab’s web site.
dat <- foreign::read.spss("demoData/MODMED.sav", to.data.frame = TRUE)[c(2, 3, 5, 7)]
## print first and last 3 rows
dat[c(1:3, 98:100), ]
## MOOD NFC POS ATT
## 1 -1 3.07579 6.2500 18.5455
## 2 -1 2.59489 -5.0671 -16.9684
## 3 -1 -1.00952 0.9346 -5.9578
## 98 1 0.14496 -6.6079 -0.6811
## 99 1 0.28151 15.1838 30.5421
## 100 1 -0.06157 -6.0116 11.7879
Recall from Chapter 1 that SEM can equivalently be fitted to summary statistics rather than raw data. This chapter will demonstrate this both approaches.
Also recall that it is possible to fit a “multigroup SEM” to obtain parameter estimates separately per group. This chapter will also demonstrate how to fit a multigroup SEM to raw data as well as to summary statistics. For the latter, we require group-specific summary statistics, which Chapter 1 discussed how to obtain. Here, we use slightly more efficient syntax to obtain lists of group-specific summary statistics, although the more sophisticated sapply()
function makes the syntax less immediately intuitive to read.
## Create factor from effects-coded conditions
dat$mood.f <- factor(dat$MOOD, levels = c(-1, 1),
labels = c("neutral","positive"))
## Save lists of group-specific summary statistics
CC <- c("ATT","NFC","POS") # when group = "mood.f"
gM <- sapply(c("neutral","positive"), simplify = FALSE,
FUN = function(g) colMeans(dat[dat$mood.f == g, CC]) )
gS <- sapply(c("neutral","positive"), simplify = FALSE,
FUN = function(g) cov( dat[dat$mood.f == g, CC]) )
gN <- table(dat$mood.f)
2.1.2 lavaan
Syntax
Notice from the section-header that lavaan
should always be lower-case. It is a portmanteau of tent riable alysis, similar to alysis f riance (ANOVA).
Load lavaan
into your workspace.
The standard syntax for regression in R is a formula
object. To regress ATT
itude on MOOD
condition, N
eed F
or C
ognition (NFC
), and POS
itive thoughts:
SEM is a multivariate modeling framework, so a formula
object does not allow for sufficient complexity.
- There can be many outcome variables, each with its own formula (i.e., different predictors)
- Multivariate GLMs (e.g., MANOVA) require the same predictors for all outcomes
- e.g.,
cbind(ATT, POS) ~ MOOD + NFC
- Outcomes can even predict other outcomes, which GLM does not allow
- Another name for an SEM is a simultaneous equations model.
Later chapters will introduce additional complexity that formula
objects do not allow for:
- Residual variances can be specified explicitly using a “double tilde” operator (
~~
), which allows specifying equality constraints on variance estimates. - There can be unobserved (latent) variables, which require a special operator (
=~
) to define inlavaan
syntax. - Details about these and other operators can be found in the Details section of the
?model.syntax
help page, and the website also has a syntax tutorial for various short topics. Details about syntax features we will use in this chapter are summarized in the table below.
Operator | Purpose | Example |
---|---|---|
~ |
Estimate regression slope(s) | 'outcome ~ predictor' |
~1 |
Estimate intercept(s) | 'outcome ~ 1' |
value* |
Fix (number) or free (missing value) parameters | 'posttest ~ 1*pretest + NA*x1' |
string* |
Label parameters | 'y ~ myLabel*x' |
+ |
Additional parameters or operators | 'y ~ NA*x1 + beta2*x2' |
c() |
Operators for multiple groups | 'y ~ c(NA, 1)*x1 + c(beta1, beta2)*x2' |
:= |
Define a function of parameters | 'slopeRatio := beta1 / beta2' |
lavaan
requires collecting all of these model parameters in a character
vector, even when there is only one outcome.
Note how the syntax above looks exactly like the formula
we would pass to lm()
, except that it is within quotation marks. lavaan
’s model syntax is flexible enough that we can equivalently specify each parameter on a different line, which can be useful in larger models or to make #comments
in syntax.
Once we specify a model (typically saving the character string to an object), we can fit that model to the (raw or summary) data.
There are multiple model-fitting functions in the lavaan
package:
lavaan()
is the main “engine”, but expects that models are fully specified in complete detailsem()
is a “wrapper” that callslavaan()
with some sensible defaults that apply to most SEMs (e.g., automatically estimating residual variances, automatically estimating covariances among predictors)cfa()
is meant for fitting confirmatory factor analysis (CFA) models, discussed in a later chapter.cfa()
andsem()
actually behave identically (i.e., they calllavaan()
with the same defaults)growth()
is meant for very simple latent growth curve models, also discussed in a later chapter
In this chapter, we will use the sem()
function similar to the way we use the lm()
function, including how we obtain results from summary()
.
2.2 Comparing 2 Group Means in GLM and SEM
This section compares SEM with GLM in the simple case of comparing 2 group means.
2.2.1 The GLM Approach
Recall that an independent-samples t test (assuming homoskedasticity across groups) is equivalent to testing the estimated slope of a dummy code in a simple regression model.
##
## Two Sample t-test
##
## data: ATT by mood.f
## t = -4.3889, df = 98, p-value = 2.876e-05
## alternative hypothesis: true difference in means between group neutral and group positive is not equal to 0
## 95 percent confidence interval:
## -19.670213 -7.420887
## sample estimates:
## mean in group neutral mean in group positive
## -4.792052 8.753498
## as a regression model
mod.dummy <- lm(ATT ~ mood.f, data = dat)
summary(mod.dummy) # compare t test for slope to t.test() output
##
## Call:
## lm(formula = ATT ~ mood.f, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.818 -9.015 1.084 10.423 31.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.792 2.182 -2.196 0.0305 *
## mood.fpositive 13.546 3.086 4.389 2.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.43 on 98 degrees of freedom
## Multiple R-squared: 0.1643, Adjusted R-squared: 0.1557
## F-statistic: 19.26 on 1 and 98 DF, p-value: 2.876e-05
Recall also that the F test at the bottom of the summary.lm()
output is an ANOVA comparing the fitted model to an intercept-only model.
## Analysis of Variance Table
##
## Model 1: ATT ~ 1
## Model 2: ATT ~ mood.f
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 99 27924
## 2 98 23337 1 4587 19.263 2.876e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the case of an F test with 1-df in the numerator, the test result is equivalent to a squared t statistic with the same df as the denominator of the F test.
\[ F_{1, df} = t^2_{df} \]
This makes it simpler to test individual parameters (single-df tests) without fitting a separate model that fixes the single parameter to 0 (by dropping it from the GLM). We will see the same relationship between statistics available from SEM.
2.2.2 The SEM Approach (single-group SEM)
Fitting the same simple-regression model as an SEM is relatively simple: Replace the lm()
function with the sem()
function, and embed the formula in quotation marks. However, lavaan
will not automatically create dummy codes for factor
variables as lm()
will, so we must first make a dummy code for positive mood.
- Recall:
mod.dummy <- lm(ATT ~ mood.f, data = dat)
dat$pos.mood <- ifelse(dat$mood.f == "positive", yes = 1, no = 0)
## as an SEM
fit.dummy <- sem(' ATT ~ pos.mood ', data = dat)
summary(fit.dummy) # compare SE and z test for slope to lm() result
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## ATT ~
## pos.mood 13.546 3.055 4.433 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .ATT 233.369 33.003 7.071 0.000
The SEM’s ML estimate of the group mean difference is 13.546, which is identical to the lm()
result. However, we will focus on some differences.
2.2.2.1 Why Does SEM Have a Smaller SE
The SEM has a smaller SE estimate, which in turn makes its Wald z test statistic larger than the t statistic. Note that both statistics are calculated identically as the ratio of the estimate to its SE (i.e., a Wald test). More generally, any \(H_0\) can be tested about the parameter \(\beta\), where the null-hypothesized value is represented by \(\beta_0\):
\[t \text{ or } z = \frac{\hat{\beta} - \beta_0}{SE}\]
The summary()
methods simply set each \(\beta_0=0\) by default, so testing a different \(H_0\) requires calculating the test and its p value manually.
So, if they are calculated the same way, why are they z instead of t tests? SEM estimators are based on asymptotic (infinite-N) theory, which means MLE assumes \(N\) is sufficiently large that the sample covariance matrix \(S\) is approximately equivalent to the population covariance matrix \(\Sigma\). The asymptotic assumption yields more power, but inflated Type I error rates when sample size is low (relative to number of estimated parameters). So it would be more robust to interpret the Wald z statistic as a t statistic in small samples; unfortunately, there is no straight-forward way to derive an appropriate df parameter for the t distribution.
2.2.2.2 Why Do Residual Variance Estimates Differ?
The “Residual standard error” in the lower summary.lm()
output is the SD of the residuals, so its square is the estimated variance of the residuals (238.131). The SEM estimate is smaller (233.369), which also follows from using asymptotic theory. The formula for a variance divides the sum of squared residuals by N when the population mean is known, but GLM divides by the model’s df (in this case, \(N-2\)). If we rescaled the GLM estimate by \(\frac{N-2}{N}\), we would obtain the same asymptotic estimate that MLE provides from our SEM result:
## [1] 233.3687
The same rescaling by \(\frac{N-2}{N}\) is responsible for the smaller SEs in the SEM result, although it is not the SE itself that is rescaled (see previous section).
2.2.2.3 Where Is the Intercept in SEM?
The SEM result had no intercept. This is because we did not specify one in our syntax. Whereas a formula
object for lm()
automatically adds an intercept (implicitly regressing on a constant: ATT ~ 1 + mood.f
), the sem()
function will not do so in simple models like this. The previous chapter (section Linear Regression Models) explains why we can omit mean-structure parameters when data are mean-centered (i.e., intercept would be 0 anyway). But SEM doesn’t actually require us mean-center the data to obtain such results, which is convenient when we have no hypotheses about the intercepts.
To add intercepts to the model, we can simply add the argument meanstructure=TRUE
, which would be set automatically whenever we explicitly add any intercept parameter to the model syntax.
fit.dummy <- sem(' ATT ~ 1 + pos.mood ', data = dat)
parameterEstimates(fit.dummy, output = "pretty", ci = FALSE)
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## ATT ~
## pos.mood 13.546 3.055 4.433 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .ATT -4.792 2.160 -2.218 0.027
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .ATT 233.369 33.003 7.071 0.000
As with the estimated slope, the estimated intercept is identical between GLM and SEM, but SEM has a smaller SE, resulting in a larger Wald z statistic.
2.2.2.4 Model-Comparison Approach (Analogous to F Test)
Comparing nested GLMs provides an \(F\) statistic. Analogously, comparing nested SEMs provides a \(\chi^2\) statistic, which is also follows from using asymptotic theory. As the denominator df of the F statistic approaches \(\infty\), the F statistic (multiplied by its numerator df) more closely follows a \(\chi^2\) distribution with the same (numerator) df.
However, an important caveat follows from the criterion used to find the “best” parameters in SEM. Whereas GLM minimizes casewise residuals, SEM minimizes residuals of summary statistics (review Types of Data Used in SEM for details). In a GLM, it does not matter if we remove a predictor \(x\) from a model because the only data we try to reproduce is the outcome \(y\) (i.e., both nested models are fitted to the same data). But in SEM, we are fitting models to reproduce the sample means \(\bar{y}\) and covariance matrix \(\mathbf{S}\), which include both ATT
itude and the positive-MOOD
dummy code:
## $cov
## ATT pos.md
## ATT 279.239
## pos.mood 3.386 0.250
##
## $mean
## ATT pos.mood
## 1.981 0.500
If we were to “remove” the positive-MOOD
dummy code from the SEM, the sample covariance matrix would no longer be a 2 \(\times\) 2 matrix, but a 1 \(\times\) 1 matrix containing only the variance of the outcome ATT
itude.
## $cov
## ATT
## ATT 279.239
##
## $mean
## ATT
## 1.981
Thus, we would not be fitting both nested SEMs to (all of) the same data. The anova()
method would warn us about this if we tried it:
## Warning: lavaan->lavTestLRT():
## some models are based on a different set of observed variables
## Warning: lavaan->lavTestLRT():
## some models have the same degrees of freedom
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fit.int 0 850.99 856.20 0
## fit.dummy 0 835.05 842.87 0 -8.8818e-14 0 0
Instead, we need to keep both variables in the model—any models we want to compare must include all the same variables. To represent the \(H_0: \beta_0=0\), we must fix it to 0 instead of freely estimating it (which represents the \(H_A: \beta_0 \ne 0\)). This is accomplished in the model syntax by placing a zero in front of the predictor, with an asterisk between the value and the variable (0*pos.mood
):
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fit.dummy 0 835.05 842.87 0.000
## fit.int 1 850.99 856.20 17.945 17.945 0.41164 1 2.274e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although we can use the generic model-comparing anova()
method in R, this is not an analysis of variance (ANOVA), but rather an analysis of deviance, more commonly called a likelihood ratio test (LRT).
2.2.3 Multigroup SEM Approach
The previous section demonstrated how to test the \(H_0\) that 2 groups have identical means, using either GLM (t test) or an analogous simple-regression in single-group SEM. Like a GLM, this makes the simplifying assumption of homoskedasticity (same residual variance within each group), represented earlier by setting the t.test()
argument var.equal=TRUE
.
An alternative SEM approach is to fit models separately in each group, using the group=
argument. In this case, we would fit an intercept-only model for ATT
itude, estimating the parameters separately in each MOOD
group: not only the intercept, but also the variance!
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [neutral]:
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## ATT -4.792 2.191 -2.187 0.029
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## ATT 240.120 48.024 5.000 0.000
##
##
## Group 2 [positive]:
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## ATT 8.753 2.129 4.112 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## ATT 226.617 45.323 5.000 0.000
There is an analogous t test that does not assume equal variances (which differ in the output above):
##
## Welch Two Sample t-test
##
## data: ATT by mood.f
## t = -4.3889, df = 97.918, p-value = 2.878e-05
## alternative hypothesis: true difference in means between group neutral and group positive is not equal to 0
## 95 percent confidence interval:
## -19.670277 -7.420823
## sample estimates:
## mean in group neutral mean in group positive
## -4.792052 8.753498
But now that the group-mean-difference is not a distinct SEM parameter (like the slope was in the single-group approach), how do we test the \(H_0\) of equivalent means? In SEM, there are two approaches.
2.2.3.1 Compare Nested MG-SEMs
To fit a MG-SEM in which the \(H_0\) must be true, we must constrain the 2 group means to be the same value. We can do this by giving them the same label in the model syntax. The label is arbitrary (below, we use mu
for \(\mu\)), as long as the same label is used for the parameter in both groups. As shown in the table above (the lavaan
Syntax section), labeling a single parameter works like fixing (or freeing) a parameter to a particular value, but using a character string instead of a number (or NA).
## [1] " ATT ~ mu*1 "
In multiple groups, we use the c()
function to make a vector of labels, where the first label applies to Group 1’s parameter, the second applies to Group 2, etc.
mg.eq <- sem(' ATT ~ c(mu, mu)*1 ', data = dat,
group = "mood.f") # grouping variable here, not in model
summary(mg.eq, header = FALSE)
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [neutral]:
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## ATT (mu) 2.225 1.670 1.332 0.183
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## ATT 289.353 57.871 5.000 0.000
##
##
## Group 2 [positive]:
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## ATT (mu) 2.225 1.670 1.332 0.183
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## ATT 269.244 53.849 5.000 0.000
Notice that the estimated mean is the same in both groups. We can compare these nested models because they were fit to the same data (ATT
is the only variable in both cases).
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## mg.int 0 837.01 847.43 0.000
## mg.eq 1 852.95 860.77 17.943 17.943 0.58212 1 2.276e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.2.3.2 Define Parameters to Obtain Wald Test
Another way to test the same \(H_0\) would be to define a new parameter that is a function of estimated parameters. In our case, we want to freely estimate the mean in each group (i.e., no equality constraint), but calculate the difference between means as a user-defined parameter. This is possible using the :=
operator in lavaan
syntax, which requires labeling the parameter. In this case, the labels must differ so that they are not equal. Because we will have more than one line of model syntax, the example below saves the syntax as an object first, which is the recommended way of specifying a model in lavaan
.
mod <- ' ATT ~ c(mu1, mu2)*1 # freely estimate 2 means
mean_difference := mu2 - mu1 # calculate difference
'
fit <- sem(mod, data = dat, group = "mood.f")
summary(fit, header = FALSE)
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##
## Group 1 [neutral]:
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## ATT (mu1) -4.792 2.191 -2.187 0.029
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## ATT 240.120 48.024 5.000 0.000
##
##
## Group 2 [positive]:
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## ATT (mu2) 8.753 2.129 4.112 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## ATT 226.617 45.323 5.000 0.000
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
## mean_differenc 13.546 3.055 4.433 0.000
This model is equivalent to mg.int
in the previous section, but the means are labeled and a mean_difference
parameter is defined. lavaan
uses the to estimate the SE of a function of parameters (here, difference between 2 intercepts) from the SEs of the original parameter estimates. Thus, we can obtain a Wald z statistic in the summary()
output. We could also obtain an equivalent Wald \(\chi^2\) statistic, which \(=z^2\) when testing a single parameter (i.e., 1 df).
## $stat
## [1] 19.6558
##
## $df
## [1] 1
##
## $p.value
## [1] 9.272142e-06
##
## $se
## [1] "standard"
But the Wald \(\chi^2\) test generalizes to the situation where multiple parameters are constrained. So it is useful for the same reason model-comparison with a LRT statistic is useful, but it does not require actually fitting the \(H_0\) model.
2.2.3.3 Summary of Advantages of Multigroup SEM
- Less restrictive assumptions
- parameters differ by default
- could be disadvantageous in small groups (asymptotic/large-N assumption applies to each group)
- More interpretable parameters
- each group has their own intercept and slopes
- Intuitive to specify \(H_0\) as user-defined parameter
- e.g., the difference between the intercepts in the treatment group (
b2
) and control group (b1
) is zero
- e.g., the difference between the intercepts in the treatment group (
- Intuitive to represent assumptions about equality constraints by using the same labels across groups
- testable by comparing models with(out) constraints
- Can easily test \(H_0\) about parameters other than means
- Are variances or correlations equal across groups?
2.3 Fit an SEM to Summary Statistics
All of the models above were fitted to raw data=
. When the data are complete (no missing observations: NA
), lavaan
automatically calculates summary statistics from data=
. Alternatively, summary statistics can be passed directly via the sample.cov=
and sample.mean=
arguments, which also requires specifying the sample size via the sample.nobs=
argument. When assuming normality for complete data, results are identical across data formats:
## Raw Data
sem('ATT ~ MOOD + NFC + POS', data = dat,
meanstructure = TRUE) # explicitly request intercept
## Summary Data
sem('ATT ~ MOOD + NFC + POS', sample.nobs = N,
sample.cov = S, sample.mean = M)
When omitting the mean structure (meanstructure=FALSE
), also omit the sample.mean=
argument. When fitting a multigroup SEM, you must pass a list of (means and) covariance matrices and a vector of sample sizes
gN # vector of sample sizes
gM # mean vectors
gS # list of covariance matrices
sem('ATT ~ NFC + POS', sample.nobs = gN, # "group=" is implied by lists
sample.mean = gM, sample.cov = gS)
2.3.1 Advantages of Analyzing Summary Statistics
- Maintain privacy: They do not compromise the privacy of the research participants.
- Promote open-science practices: Summary statistics are easily reported in published empirical research without risking privacy, facilitating replication of results.
- This enables us to use many published results for instruction in this book.
- Facilitate replication: Research consumers can fit different models (e.g., that represent a competing theories) to the same summary statistics.
However, it is unrealistic to assume data are perfectly normally distributed, and data are frequently incomplete in practice. Raw data=
are required to:
- analyze incomplete data (using full information MLE rather than listwise deletion)
- obtain robust statistics (e.g., to account for nonnormality of continuous outcomes)
- model discrete (binary or ordinal) outcomes (which require a threshold model, explained later in a chapter about categorical data)
2.4 Moderation in SEM
When evaluating moderation using a product term, that product term becomes a new variable in \(\bar{y}\) and \(S\). Thus, adding an interaction to an SEM typically requires raw data (see Boker et al., 2023, for an exception).
Consider an example analogous to ANCOVA: We want to test MOOD
’s effect on POS
itive thoughts, controlling for NFC
. Before comparing adjusted group means, we should evaluate the homogeneity-of-slopes assumption.
As in formula
objects, lavaan
syntax recognizes the colon (:
) operator; however, there are caveats:
- It only applies to a product between 2 numeric variables (which may be dummy codes, but not
factor
variables), so no 3-way interactions. - It is not possible to specify how the product term covaries with other variables, so it is actually safer to manually calculate the product term as a new variable in your
data=
so that you can specify parameters flexibly in syntax. - The asterisk (
*
) is reserved for assigning labels or values to parameters, so we cannot use aformula
object’s shortcut to automatically include the interaction and main effects (i.e.,POS ~ mood.f * NFC
).
Instead, we must specify each term explicitly in the lavaan
model syntax.
Recall from the simple-regression example above that we could not test the \(H_0\) that positive MOOD
’s slope = 0 by dropping that dummy code from the SEM. If we did, we would not be fitting both SEMs to the same summary statistics. The same principle applies here to the product-term, which must stay in the nested SEM. The \(H_0\) can be represented in the nested model by fixing the parameter to zero.
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## fit.het 0 678.01 691.04 0.0000
## fit.hom 1 682.08 692.50 6.0629 6.0629 0.22501 1 0.0138 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject \(H_0\) of homogeneous NFC
slopes across MOOD
groups
2.4.1 Moderation by Groups Using MG-SEM
Earlier examples used MG-SEM to estimate means in each group, then test equivalence with model-comparison (LRT) or a Wald test. That was a test of the effect of the grouping variable (MOOD
) on the outcome (POS
).
When a MG-SEM freely estimates slopes (the effect of one variable on another), that represents moderation by the grouping variable. The \(H_0\) of no moderation would be represented by equality constraints (e.g., same label) on NFC
’s slope across groups.
Even when the focal predictor is the grouping variable, we can capitalize on the “moderation by group” principle in MG-SEM to test the homogeneity-of-slopes assumption. The syntax below uses model comparison (LRT) to test the assumption.
## Heterogeneous slopes
mg.het <- sem(' POS ~ 1 + NFC ', data = dat, group = "mood.f")
## Homogenous slopes
mod.hom <- ' POS ~ c(b1, b2)*1 # different intercepts
POS ~ c(b3, b3)*NFC # same covariate slopes
'
mg.hom <- sem(mod.hom, data = dat, group = "mood.f")
## Compare nested models
anova(mg.hom, mg.het)
##
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
## mg.het 0 680.01 695.64 0.0000
## mg.hom 1 684.05 697.07 6.0369 6.0369 0.31739 1 0.01401 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, because MG-SEM does not assume residual homoskedasticity, this test may be more robust, as long as there is no small-sample bias to cancel out that advantage. In this case, the residual variances are nearly identical, so we can expect tests of group-mean differences to be quite similar between single- and multigroup SEM approaches.
##
##
## Group 1 []:
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## POS 46.134 9.227 5.000 0.000
##
##
## Group 2 []:
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## POS 47.128 9.426 5.000 0.000
2.4.2 The emmeans
Package
The emmeans
package greatly simplifies conducting complex pairwise comparisons on lm()
output (also glm()
or lmer()
, etc.), both controlling for covariates and stratifying across moderators to probe interactions, with many options to adjust for multiple testing.
Rather than defining new parameters in model syntax, the same tools in emmeans
are made available for lavaan
results by the semTools
package.
install.packages(c("emmeans","semTools"))
library(emmeans) # load first, so semTools knows what to do with it
library(semTools)
The functionality is available for single-group SEMs that are specified the same way as regressions in lm()
.
em1 <- emmeans(fit.het,
specs = ~ pos.mood | NFC, # focal predictor | moderator
lavaan.DV = "POS", # name of outcome
## probe at the M +/- 1 SD of the moderator
at = list(NFC = c(-1.4, 0, 1.4)))
## Warning: lavaan->lavPredict():
## fitted model does not contain regular (i.e., measured) latent variables; the matrix of
## factor scores may contain no columns
## Warning: lavaan->lavPredict():
## fitted model does not contain regular (i.e., measured) latent variables; the matrix of
## factor scores may contain no columns
## Warning: lavaan->lavPredict():
## fitted model does not contain regular (i.e., measured) latent variables; the matrix of
## factor scores may contain no columns
probe1 <- pairs(em1, rev = TRUE) # so group "1" minus group "0"
rbind(probe1, adjust = "sidak") # Sidak-adjusted p values
## NFC contrast estimate SE df z.ratio p.value
## -1.4 pos.mood1 - pos.mood0 5.15 1.96 Inf 2.635 0.0250
## 0 pos.mood1 - pos.mood0 8.67 1.37 Inf 6.348 <.0001
## 1.4 pos.mood1 - pos.mood0 12.19 1.97 Inf 6.198 <.0001
##
## P value adjustment: sidak method for 3 tests
The functionality is also available for multigroup SEMs, in which case the grouping variable is automatically treated as another predictor that can be specified in the specs=
argument. Note that the grouping variable will be the name we passed to the group=
argument when we fitted the MG-SEM (i.e., the factor variable mood.f
rather than the dummy-coded integer variable pos.mood
). When using emmeans()
on a multigroup lavaan
object, the argument nesting=NULL
must be specified.
em2 <- emmeans(mg.het,
specs = ~ mood.f | NFC, # focal predictor | moderator
lavaan.DV = "POS", # name of outcome
nesting = NULL, # necessary for MG-SEMs
## probe at the M +/- 1 SD of the moderator
at = list(NFC = c(-1.4, 0, 1.4)))
probe2 <- pairs(em2, rev = TRUE) # so group "1" minus group "0"
rbind(probe2, adjust = "sidak") # Sidak-adjusted p values
## NFC contrast estimate SE df z.ratio p.value
## -1.4 positive - neutral 5.15 1.96 Inf 2.633 0.0251
## 0 positive - neutral 8.67 1.37 Inf 6.348 <.0001
## 1.4 positive - neutral 12.19 1.97 Inf 6.193 <.0001
##
## P value adjustment: sidak method for 3 tests
As expected, the test statistics are quite similar between single- and multigroup SEM approaches, differing only in the 3rd or 4th decimal place.
2.5 Summary
This chapter compared GLM and SEM approaches to testing a mean-difference between 2 groups, comparing output to reveal how statistics can differ between GLM and SEM. Both single- and multigroup SEM approaches were used, the latter providing the advantage of robustness to heteroskedasticity across groups. Adding a covariance facilitated demonstrating how to include an interaction term in a SEM, and why it is necessary to fix a variable’s parameter to 0 rather than drop it from the model when testing that \(H_0\). Models were fitted to both raw data=
and summary statistics, which can provide some data-sharing advantages but is equivalent under restrictive conditions. The review and comparison both serve as a foundation for upcoming chapters that show how path analysis generalizes normal regression models.
References
Rosseel, Y. (2012). lavaan
: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. http://dx.doi.org/10.18637/jss.v048.i02