2 Using lavaan to Run Regression and ANOVA as a SEM

This chapter prepares the reader to learn about structural equation modeling (SEM) by reviewing the fundamentals of ordinary least-squares (OLS) regression in the general(ized) linear modeling (GLM) framework. Some key take-home messages will reinforce what was discussed about covariance structure analysis in the previous chapter:

  • Analyzing different data formats (raw vs. summary data) yields equivalent results (under conditions of normality and complete data).
    • Recall from the previous chapter that maximum-likelihood estimation (MLE) in SEM chooses parameters that minimize discrepancies between observed and expected summary statistics rather than casewise observations.
  • Grouping variables can be represented as dummy, effects, or contrast codes, but there are some advantages of stratifying results by group. This will be demonstrated using multigroup SEM and comparing results to the single-group approach.
  • Point estimates from SEM will match GLM because OLS estimates are ML estimates when the distributional assumptions are met.
    • independent, identically distributed (normal with homogeneous variance)
  • However, the SEs are typically smaller under ML with SEM than under OLS with GLM.
  • Furthermore, different (but analogous) test statistics will be compared between GLM (t and F statistics) and SEM (z and \(\chi^2\) statistics).
  • Model comparisons require fitting nested models to the same data, which in SEM means fitting them to the same summary statistics.

2.1 Prepare Data and Workspace

2.1.1 Import Example Data

We will use data from a tutorial published on the Social Change Lab’s web site.

dat <- foreign::read.spss("demoData/MODMED.sav", to.data.frame = TRUE)[c(2, 3, 5, 7)]
## print first and last 3 rows
dat[c(1:3, 98:100), ]
##     MOOD      NFC     POS      ATT
## 1     -1  3.07579  6.2500  18.5455
## 2     -1  2.59489 -5.0671 -16.9684
## 3     -1 -1.00952  0.9346  -5.9578
## 98     1  0.14496 -6.6079  -0.6811
## 99     1  0.28151 15.1838  30.5421
## 100    1 -0.06157 -6.0116  11.7879

Recall from Chapter 1 that SEM can equivalently be fitted to summary statistics rather than raw data. This chapter will demonstrate this both approaches.

## Save summary statistics
M <- colMeans(dat)
S <- cov(dat)
N <- nrow(dat)

Also recall that it is possible to fit a “multigroup SEM” to obtain parameter estimates separately per group. This chapter will also demonstrate how to fit a multigroup SEM to raw data as well as to summary statistics. For the latter, we require group-specific summary statistics, which Chapter 1 discussed how to obtain. Here, we use slightly more efficient syntax to obtain lists of group-specific summary statistics, although the more sophisticated sapply() function makes the syntax less immediately intuitive to read.

## Create factor from effects-coded conditions
dat$mood.f <- factor(dat$MOOD, levels = c(-1, 1),
                     labels = c("neutral","positive"))
## Save lists of group-specific summary statistics
CC <- c("ATT","NFC","POS")   # when group = "mood.f"
gM <- sapply(c("neutral","positive"), simplify = FALSE, 
             FUN = function(g) colMeans(dat[dat$mood.f == g, CC]) )
gS <- sapply(c("neutral","positive"), simplify = FALSE,
             FUN = function(g) cov(     dat[dat$mood.f == g, CC]) )
gN <- table(dat$mood.f)

2.1.2 lavaan Syntax

Notice from the section-header that lavaan should always be lower-case. It is a portmanteau of tent riable alysis, similar to alysis f riance (ANOVA).

Load lavaan into your workspace.

library(lavaan)

The standard syntax for regression in R is a formula object. To regress ATTitude on MOOD condition, Need For Cognition (NFC), and POSitive thoughts:

ATT ~ MOOD + NFC + POS

SEM is a multivariate modeling framework, so a formula object does not allow for sufficient complexity.

  • There can be many outcome variables, each with its own formula (i.e., different predictors)
    • Multivariate GLMs (e.g., MANOVA) require the same predictors for all outcomes
    • e.g., cbind(ATT, POS) ~ MOOD + NFC
  • Outcomes can even predict other outcomes, which GLM does not allow
    • Another name for an SEM is a simultaneous equations model.

Later chapters will introduce additional complexity that formula objects do not allow for:

  • Residual variances can be specified explicitly using a “double tilde” operator (~~), which allows specifying equality constraints on variance estimates.
  • There can be unobserved (latent) variables, which require a special operator (=~) to define in lavaan syntax.
  • Details about these and other operators can be found in the Details section of the ?model.syntax help page, and the website also has a syntax tutorial for various short topics. Details about syntax features we will use in this chapter are summarized in the table below.
Operator Purpose Example
~ Estimate regression slope(s) 'outcome ~ predictor'
~1 Estimate intercept(s) 'outcome ~ 1'
value* Fix (number) or free (missing value) parameters 'posttest ~ 1*pretest + NA*x1'
string* Label parameters 'y ~ myLabel*x'
+ Additional parameters or operators 'y ~ NA*x1 + beta2*x2'
c() Operators for multiple groups 'y ~ c(NA, 1)*x1 + c(beta1, beta2)*x2'
:= Define a function of parameters 'slopeRatio := beta1 / beta2'

lavaan requires collecting all of these model parameters in a character vector, even when there is only one outcome.

' ATT ~ MOOD + NFC + POS '

Note how the syntax above looks exactly like the formula we would pass to lm(), except that it is within quotation marks. lavaan’s model syntax is flexible enough that we can equivalently specify each parameter on a different line, which can be useful in larger models or to make #comments in syntax.

' ATT ~ 1     # intercept
  ATT ~ MOOD  # slopes
  ATT ~ NFC
  ATT ~ POS '

Once we specify a model (typically saving the character string to an object), we can fit that model to the (raw or summary) data. There are multiple model-fitting functions in the lavaan package:

  • lavaan() is the main “engine”, but expects that models are fully specified in complete detail
  • sem() is a “wrapper” that calls lavaan() with some sensible defaults that apply to most SEMs (e.g., automatically estimating residual variances, automatically estimating covariances among predictors)
  • cfa() is meant for fitting confirmatory factor analysis (CFA) models, discussed in a later chapter. cfa() and sem() actually behave identically (i.e., they call lavaan() with the same defaults)
  • growth() is meant for very simple latent growth curve models, also discussed in a later chapter

In this chapter, we will use the sem() function similar to the way we use the lm() function, including how we obtain results from summary().

2.2 Comparing 2 Group Means in GLM and SEM

This section compares SEM with GLM in the simple case of comparing 2 group means.

2.2.1 The GLM Approach

Recall that an independent-samples t test (assuming homoskedasticity across groups) is equivalent to testing the estimated slope of a dummy code in a simple regression model.

t.test(ATT ~ mood.f, data = dat, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  ATT by mood.f
## t = -4.3889, df = 98, p-value = 2.876e-05
## alternative hypothesis: true difference in means between group neutral and group positive is not equal to 0
## 95 percent confidence interval:
##  -19.670213  -7.420887
## sample estimates:
##  mean in group neutral mean in group positive 
##              -4.792052               8.753498
## as a regression model
mod.dummy <- lm(ATT ~ mood.f, data = dat)
summary(mod.dummy) # compare t test for slope to t.test() output
## 
## Call:
## lm(formula = ATT ~ mood.f, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.818  -9.015   1.084  10.423  31.769 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.792      2.182  -2.196   0.0305 *  
## mood.fpositive   13.546      3.086   4.389 2.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.43 on 98 degrees of freedom
## Multiple R-squared:  0.1643, Adjusted R-squared:  0.1557 
## F-statistic: 19.26 on 1 and 98 DF,  p-value: 2.876e-05

Recall also that the F test at the bottom of the summary.lm() output is an ANOVA comparing the fitted model to an intercept-only model.

mod.int <- lm(ATT ~ 1, data = dat)
anova(mod.int, mod.dummy)
## Analysis of Variance Table
## 
## Model 1: ATT ~ 1
## Model 2: ATT ~ mood.f
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     99 27924                                  
## 2     98 23337  1      4587 19.263 2.876e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the case of an F test with 1-df in the numerator, the test result is equivalent to a squared t statistic with the same df as the denominator of the F test.

\[ F_{1, df} = t^2_{df} \]

This makes it simpler to test individual parameters (single-df tests) without fitting a separate model that fixes the single parameter to 0 (by dropping it from the GLM). We will see the same relationship between statistics available from SEM.

2.2.2 The SEM Approach (single-group SEM)

Fitting the same simple-regression model as an SEM is relatively simple: Replace the lm() function with the sem() function, and embed the formula in quotation marks. However, lavaan will not automatically create dummy codes for factor variables as lm() will, so we must first make a dummy code for positive mood.

  • Recall: mod.dummy <- lm(ATT ~ mood.f, data = dat)
dat$pos.mood <- ifelse(dat$mood.f == "positive", yes = 1, no = 0)
## as an SEM
fit.dummy <- sem(' ATT ~ pos.mood ', data = dat)
summary(fit.dummy) # compare SE and z test for slope to lm() result
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ATT ~                                               
##     pos.mood         13.546    3.055    4.433    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ATT             233.369   33.003    7.071    0.000

The SEM’s ML estimate of the group mean difference is 13.546, which is identical to the lm() result. However, we will focus on some differences.

2.2.2.1 Why Does SEM Have a Smaller SE

The SEM has a smaller SE estimate, which in turn makes its Wald z test statistic larger than the t statistic. Note that both statistics are calculated identically as the ratio of the estimate to its SE (i.e., a Wald test). More generally, any \(H_0\) can be tested about the parameter \(\beta\), where the null-hypothesized value is represented by \(\beta_0\):

\[t \text{ or } z = \frac{\hat{\beta} - \beta_0}{SE}\]

The summary() methods simply set each \(\beta_0=0\) by default, so testing a different \(H_0\) requires calculating the test and its p value manually.

So, if they are calculated the same way, why are they z instead of t tests? SEM estimators are based on asymptotic (infinite-N) theory, which means MLE assumes \(N\) is sufficiently large that the sample covariance matrix \(S\) is approximately equivalent to the population covariance matrix \(\Sigma\). The asymptotic assumption yields more power, but inflated Type I error rates when sample size is low (relative to number of estimated parameters). So it would be more robust to interpret the Wald z statistic as a t statistic in small samples; unfortunately, there is no straight-forward way to derive an appropriate df parameter for the t distribution.

2.2.2.2 Why Do Residual Variance Estimates Differ?

The “Residual standard error” in the lower summary.lm() output is the SD of the residuals, so its square is the estimated variance of the residuals (238.131). The SEM estimate is smaller (233.369), which also follows from using asymptotic theory. The formula for a variance divides the sum of squared residuals by N when the population mean is known, but GLM divides by the model’s df (in this case, \(N-2\)). If we rescaled the GLM estimate by \(\frac{N-2}{N}\), we would obtain the same asymptotic estimate that MLE provides from our SEM result:

sigma(mod.dummy)^2 * (N - 2) / N
## [1] 233.3687

The same rescaling by \(\frac{N-2}{N}\) is responsible for the smaller SEs in the SEM result, although it is not the SE itself that is rescaled (see previous section).

2.2.2.3 Where Is the Intercept in SEM?

The SEM result had no intercept. This is because we did not specify one in our syntax. Whereas a formula object for lm() automatically adds an intercept (implicitly regressing on a constant: ATT ~ 1 + mood.f), the sem() function will not do so in simple models like this. The previous chapter (section Linear Regression Models) explains why we can omit mean-structure parameters when data are mean-centered (i.e., intercept would be 0 anyway). But SEM doesn’t actually require us mean-center the data to obtain such results, which is convenient when we have no hypotheses about the intercepts.

To add intercepts to the model, we can simply add the argument meanstructure=TRUE, which would be set automatically whenever we explicitly add any intercept parameter to the model syntax.

fit.dummy <- sem(' ATT ~ 1 + pos.mood ', data = dat)
parameterEstimates(fit.dummy, output = "pretty", ci = FALSE)
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ATT ~                                               
##     pos.mood         13.546    3.055    4.433    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ATT              -4.792    2.160   -2.218    0.027
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .ATT             233.369   33.003    7.071    0.000

As with the estimated slope, the estimated intercept is identical between GLM and SEM, but SEM has a smaller SE, resulting in a larger Wald z statistic.

2.2.2.4 Model-Comparison Approach (Analogous to F Test)

Comparing nested GLMs provides an \(F\) statistic. Analogously, comparing nested SEMs provides a \(\chi^2\) statistic, which is also follows from using asymptotic theory. As the denominator df of the F statistic approaches \(\infty\), the F statistic (multiplied by its numerator df) more closely follows a \(\chi^2\) distribution with the same (numerator) df.

However, an important caveat follows from the criterion used to find the “best” parameters in SEM. Whereas GLM minimizes casewise residuals, SEM minimizes residuals of summary statistics (review Types of Data Used in SEM for details). In a GLM, it does not matter if we remove a predictor \(x\) from a model because the only data we try to reproduce is the outcome \(y\) (i.e., both nested models are fitted to the same data). But in SEM, we are fitting models to reproduce the sample means \(\bar{y}\) and covariance matrix \(\mathbf{S}\), which include both ATTitude and the positive-MOOD dummy code:

lavInspect(fit.dummy, "sampstat") # observed sample statistics
## $cov
##              ATT  pos.md
## ATT      279.239        
## pos.mood   3.386   0.250
## 
## $mean
##      ATT pos.mood 
##    1.981    0.500

If we were to “remove” the positive-MOOD dummy code from the SEM, the sample covariance matrix would no longer be a 2 \(\times\) 2 matrix, but a 1 \(\times\) 1 matrix containing only the variance of the outcome ATTitude.

fit.int <- sem(' ATT ~ 1 ', data = dat)
lavInspect(fit.int, "sampstat")
## $cov
##         ATT
## ATT 279.239
## 
## $mean
##   ATT 
## 1.981

Thus, we would not be fitting both nested SEMs to (all of) the same data. The anova() method would warn us about this if we tried it:

anova(fit.int, fit.dummy)
## Warning in lavTestLRT(object = object, ..., model.names = NAMES): lavaan WARNING: some models are
## based on a different set of observed variables
## Warning in lavTestLRT(object = object, ..., model.names = NAMES): lavaan WARNING: some models have
## the same degrees of freedom
## 
## Chi-Squared Difference Test
## 
##           Df    AIC    BIC Chisq  Chisq diff RMSEA Df diff Pr(>Chisq)
## fit.int    0 850.99 856.20     0                                     
## fit.dummy  0 835.05 842.87     0 -8.8818e-14     0       0

Instead, we need to keep both variables in the model—any models we want to compare must include all the same variables. To represent the \(H_0: \beta_0=0\), we must fix it to 0 instead of freely estimating it (which represents the \(H_A: \beta_0 \ne 0\)). This is accomplished in the model syntax by placing a zero in front of the predictor, with an asterisk between the value and the variable (0*pos.mood):

fit.int <- sem(' ATT ~ 1 + 0*pos.mood ', data = dat)
anova(fit.int, fit.dummy)
## 
## Chi-Squared Difference Test
## 
##           Df    AIC    BIC  Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)    
## fit.dummy  0 835.05 842.87  0.000                                          
## fit.int    1 850.99 856.20 17.945     17.945 0.41164       1  2.274e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although we can use the generic model-comparing anova() method in R, this is not an analysis of variance (ANOVA), but rather an analysis of deviance, more commonly called a likelihood ratio test (LRT).

2.2.3 Multigroup SEM Approach

The previous section demonstrated how to test the \(H_0\) that 2 groups have identical means, using either GLM (t test) or an analogous simple-regression in single-group SEM. Like a GLM, this makes the simplifying assumption of homoskedasticity (same residual variance within each group), represented earlier by setting the t.test() argument var.equal=TRUE.

An alternative SEM approach is to fit models separately in each group, using the group= argument. In this case, we would fit an intercept-only model for ATTitude, estimating the parameters separately in each MOOD group: not only the intercept, but also the variance!

mg.int <- sem(' ATT ~ 1 ', data = dat, group = "mood.f")
summary(mg.int, header = FALSE)
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [neutral]:
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT              -4.792    2.191   -2.187    0.029
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT             240.120   48.024    5.000    0.000
## 
## 
## Group 2 [positive]:
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT               8.753    2.129    4.112    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT             226.617   45.323    5.000    0.000

There is an analogous t test that does not assume equal variances (which differ in the output above):

t.test(ATT ~ mood.f, data = dat) # var.equal = FALSE by default
## 
##  Welch Two Sample t-test
## 
## data:  ATT by mood.f
## t = -4.3889, df = 97.918, p-value = 2.878e-05
## alternative hypothesis: true difference in means between group neutral and group positive is not equal to 0
## 95 percent confidence interval:
##  -19.670277  -7.420823
## sample estimates:
##  mean in group neutral mean in group positive 
##              -4.792052               8.753498

But now that the group-mean-difference is not a distinct SEM parameter (like the slope was in the single-group approach), how do we test the \(H_0\) of equivalent means? In SEM, there are two approaches.

2.2.3.1 Compare Nested MG-SEMs

To fit a MG-SEM in which the \(H_0\) must be true, we must constrain the 2 group means to be the same value. We can do this by giving them the same label in the model syntax. The label is arbitrary (below, we use mu for \(\mu\)), as long as the same label is used for the parameter in both groups. As shown in the table above (the lavaan Syntax section), labeling a single parameter works like fixing (or freeing) a parameter to a particular value, but using a character string instead of a number (or NA).

' ATT ~ mu*1 '
## [1] " ATT ~ mu*1 "

In multiple groups, we use the c() function to make a vector of labels, where the first label applies to Group 1’s parameter, the second applies to Group 2, etc.

mg.eq <- sem(' ATT ~ c(mu, mu)*1 ', data = dat,
             group = "mood.f") # grouping variable here, not in model
summary(mg.eq, header = FALSE)
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [neutral]:
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT       (mu)    2.225    1.670    1.332    0.183
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT             289.353   57.871    5.000    0.000
## 
## 
## Group 2 [positive]:
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT       (mu)    2.225    1.670    1.332    0.183
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT             269.244   53.849    5.000    0.000

Notice that the estimated mean is the same in both groups. We can compare these nested models because they were fit to the same data (ATT is the only variable in both cases).

anova(mg.eq, mg.int) # LRT statistic
## 
## Chi-Squared Difference Test
## 
##        Df    AIC    BIC  Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)    
## mg.int  0 837.01 847.43  0.000                                          
## mg.eq   1 852.95 860.77 17.943     17.943 0.58212       1  2.276e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.2.3.2 Define Parameters to Obtain Wald Test

Another way to test the same \(H_0\) would be to define a new parameter that is a function of estimated parameters. In our case, we want to freely estimate the mean in each group (i.e., no equality constraint), but calculate the difference between means as a user-defined parameter. This is possible using the := operator in lavaan syntax, which requires labeling the parameter. In this case, the labels must differ so that they are not equal. Because we will have more than one line of model syntax, the example below saves the syntax as an object first, which is the recommended way of specifying a model in lavaan.

mod <- ' ATT ~ c(mu1, mu2)*1    # freely estimate 2 means
  mean_difference := mu2 - mu1  # calculate difference
'
fit <- sem(mod, data = dat, group = "mood.f")
summary(fit, header = FALSE)
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [neutral]:
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT      (mu1)   -4.792    2.191   -2.187    0.029
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT             240.120   48.024    5.000    0.000
## 
## 
## Group 2 [positive]:
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT      (mu2)    8.753    2.129    4.112    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ATT             226.617   45.323    5.000    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     mean_differenc   13.546    3.055    4.433    0.000

This model is equivalent to mg.int in the previous section, but the means are labeled and a mean_difference parameter is defined. lavaan uses the to estimate the SE of a function of parameters (here, difference between 2 intercepts) from the SEs of the original parameter estimates. Thus, we can obtain a Wald z statistic in the summary() output. We could also obtain an equivalent Wald \(\chi^2\) statistic, which \(=z^2\) when testing a single parameter (i.e., 1 df).

lavTestWald(fit, constraints = 'mu1 == mu2')
## $stat
## [1] 19.6558
## 
## $df
## [1] 1
## 
## $p.value
## [1] 9.272142e-06
## 
## $se
## [1] "standard"

But the Wald \(\chi^2\) test generalizes to the situation where multiple parameters are constrained. So it is useful for the same reason model-comparison with a LRT statistic is useful, but it does not require actually fitting the \(H_0\) model.

2.2.3.3 Summary of Advantages of Multigroup SEM

  • Less restrictive assumptions
    • parameters differ by default
    • could be disadvantageous in small groups (asymptotic/large-N assumption applies to each group)
  • More interpretable parameters
    • each group has their own intercept and slopes
  • Intuitive to specify \(H_0\) as user-defined parameter
    • e.g., the difference between the intercepts in the treatment group (b2) and control group (b1) is zero
  • Intuitive to represent assumptions about equality constraints by using the same labels across groups
    • testable by comparing models with(out) constraints
  • Can easily test \(H_0\) about parameters other than means
    • Are variances or correlations equal across groups?

2.3 Fit an SEM to Summary Statistics

All of the models above were fitted to raw data=. When the data are complete (no missing observations: NA), lavaan automatically calculates summary statistics from data=. Alternatively, summary statistics can be passed directly via the sample.cov= and sample.mean= arguments, which also requires specifying the sample size via the sample.nobs= argument. When assuming normality for complete data, results are identical across data formats:

## Raw Data
sem('ATT ~ MOOD + NFC + POS', data = dat,
    meanstructure = TRUE) # explicitly request intercept
## Summary Data
sem('ATT ~ MOOD + NFC + POS', sample.nobs = N,
    sample.cov = S, sample.mean = M)

When omitting the mean structure (meanstructure=FALSE), also omit the sample.mean= argument. When fitting a multigroup SEM, you must pass a list of (means and) covariance matrices and a vector of sample sizes

gN # vector of sample sizes
gM # mean vectors
gS # list of covariance matrices
sem('ATT ~ NFC + POS', sample.nobs = gN, # "group=" is implied by lists
    sample.mean = gM, sample.cov = gS)

2.3.1 Advantages of Analyzing Summary Statistics

  • Maintain privacy: They do not compromise the privacy of the research participants.
  • Promote open-science practices: Summary statistics are easily reported in published empirical research without risking privacy, facilitating replication of results.
    • This enables us to use many published results for instruction in this book.
  • Facilitate replication: Research consumers can fit different models (e.g., that represent a competing theories) to the same summary statistics.

However, it is unrealistic to assume data are perfectly normally distributed, and data are frequently incomplete in practice. Raw data= are required to:

  • analyze incomplete data (using full information MLE rather than listwise deletion)
  • obtain robust statistics (e.g., to account for nonnormality of continuous outcomes)
  • model discrete (binary or ordinal) outcomes (which require a threshold model, explained later in a chapter about categorical data)

2.4 Moderation in SEM

When evaluating moderation using a product term, that product term becomes a new variable in \(\bar{y}\) and \(S\). Thus, adding an interaction to an SEM typically requires raw data (see Boker et al., 2023, for an exception).

Consider an example analogous to ANCOVA: We want to test MOOD’s effect on POSitive thoughts, controlling for NFC. Before comparing adjusted group means, we should evaluate the homogeneity-of-slopes assumption.

As in formula objects, lavaan syntax recognizes the colon (:) operator; however, there are caveats:

  • It only applies to a product between 2 numeric variables (which may be dummy codes, but not factor variables), so no 3-way interactions.
  • It is not possible to specify how the product term covaries with other variables, so it is actually safer to manually calculate the product term as a new variable in your data= so that you can specify parameters flexibly in syntax.
  • The asterisk (*) is reserved for assigning labels or values to parameters, so we cannot use a formula object’s shortcut to automatically include the interaction and main effects (i.e., POS ~ mood.f * NFC).

Instead, we must specify each term explicitly in the lavaan model syntax.

fit.het <- sem('POS ~ 1 + pos.mood + NFC + pos.mood:NFC', data = dat)

Recall from the simple-regression example above that we could not test the \(H_0\) that positive MOOD’s slope = 0 by dropping that dummy code from the SEM. If we did, we would not be fitting both SEMs to the same summary statistics. The same principle applies here to the product-term, which must stay in the nested SEM. The \(H_0\) can be represented in the nested model by fixing the parameter to zero.

fit.hom <- sem('POS ~ 1 + pos.mood + NFC + 0*pos.mood:NFC', data = dat)
anova(fit.hom, fit.het)
## 
## Chi-Squared Difference Test
## 
##         Df    AIC    BIC  Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)  
## fit.het  0 678.01 691.04 0.0000                                        
## fit.hom  1 682.08 692.50 6.0629     6.0629 0.22501       1     0.0138 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject \(H_0\) of homogeneous NFC slopes across MOOD groups

2.4.1 Moderation by Groups Using MG-SEM

Earlier examples used MG-SEM to estimate means in each group, then test equivalence with model-comparison (LRT) or a Wald test. That was a test of the effect of the grouping variable (MOOD) on the outcome (POS).

When a MG-SEM freely estimates slopes (the effect of one variable on another), that represents moderation by the grouping variable. The \(H_0\) of no moderation would be represented by equality constraints (e.g., same label) on NFC’s slope across groups.

Even when the focal predictor is the grouping variable, we can capitalize on the “moderation by group” principle in MG-SEM to test the homogeneity-of-slopes assumption. The syntax below uses model comparison (LRT) to test the assumption.

## Heterogeneous slopes
mg.het <- sem(' POS ~ 1 + NFC ', data = dat, group = "mood.f")
## Homogenous slopes
mod.hom <- ' POS ~ c(b1, b2)*1    # different intercepts
             POS ~ c(b3, b3)*NFC  # same covariate slopes
'
mg.hom <- sem(mod.hom, data = dat, group = "mood.f")
## Compare nested models
anova(mg.hom, mg.het)
## 
## Chi-Squared Difference Test
## 
##        Df    AIC    BIC  Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)  
## mg.het  0 680.01 695.64 0.0000                                        
## mg.hom  1 684.05 697.07 6.0369     6.0369 0.31739       1    0.01401 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, because MG-SEM does not assume residual homoskedasticity, this test may be more robust, as long as there is no small-sample bias to cancel out that advantage. In this case, the residual variances are nearly identical, so we can expect tests of group-mean differences to be quite similar between single- and multigroup SEM approaches.

## 
## 
## Group 1 []:
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     POS              46.134    9.227    5.000    0.000
## 
## 
## Group 2 []:
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     POS              47.128    9.426    5.000    0.000

2.4.2 The emmeans Package

The emmeans package greatly simplifies conducting complex pairwise comparisons on lm() output (also glm() or lmer(), etc.), both controlling for covariates and stratifying across moderators to probe interactions, with many options to adjust for multiple testing. Rather than defining new parameters in model syntax, the same tools in emmeans are made available for lavaan results by the semTools package.

install.packages(c("emmeans","semTools"))
library(emmeans)  # load first, so semTools knows what to do with it
library(semTools)

The functionality is available for single-group SEMs that are specified the same way as regressions in lm().

em1 <- emmeans(fit.het,
               specs = ~ pos.mood | NFC, # focal predictor | moderator
               lavaan.DV = "POS",        # name of outcome
               ## probe at the M +/- 1 SD of the moderator
               at = list(NFC = c(-1.4, 0, 1.4)))
probe1 <- pairs(em1, rev = TRUE) # so group "1" minus group "0"
rbind(probe1, adjust = "sidak")  # Sidak-adjusted p values
##  NFC  contrast              estimate   SE  df z.ratio p.value
##  -1.4 pos.mood1 - pos.mood0     5.15 1.96 Inf   2.635  0.0250
##  0    pos.mood1 - pos.mood0     8.67 1.37 Inf   6.348  <.0001
##  1.4  pos.mood1 - pos.mood0    12.19 1.97 Inf   6.198  <.0001
## 
## P value adjustment: sidak method for 3 tests

The functionality is also available for multigroup SEMs, in which case the grouping variable is automatically treated as another predictor that can be specified in the specs= argument. Note that the grouping variable will be the name we passed to the group= argument when we fitted the MG-SEM (i.e., the factor variable mood.f rather than the dummy-coded integer variable pos.mood). When using emmeans() on a multigroup lavaan object, the argument nesting=NULL must be specified.

em2 <- emmeans(mg.het,
               specs = ~ mood.f | NFC, # focal predictor | moderator
               lavaan.DV = "POS",      # name of outcome
               nesting = NULL,         # necessary for MG-SEMs
               ## probe at the M +/- 1 SD of the moderator
               at = list(NFC = c(-1.4, 0, 1.4)))
probe2 <- pairs(em2, rev = TRUE) # so group "1" minus group "0"
rbind(probe2, adjust = "sidak")  # Sidak-adjusted p values
##  NFC  contrast           estimate   SE  df z.ratio p.value
##  -1.4 positive - neutral     5.15 1.96 Inf   2.633  0.0251
##  0    positive - neutral     8.67 1.37 Inf   6.348  <.0001
##  1.4  positive - neutral    12.19 1.97 Inf   6.193  <.0001
## 
## P value adjustment: sidak method for 3 tests

As expected, the test statistics are quite similar between single- and multigroup SEM approaches, differing only in the 3rd or 4th decimal place.

2.5 Summary

This chapter compared GLM and SEM approaches to testing a mean-difference between 2 groups, comparing output to reveal how statistics can differ between GLM and SEM. Both single- and multigroup SEM approaches were used, the latter providing the advantage of robustness to heteroskedasticity across groups. Adding a covariance facilitated demonstrating how to include an interaction term in a SEM, and why it is necessary to fix a variable’s parameter to 0 rather than drop it from the model when testing that \(H_0\). Models were fitted to both raw data= and summary statistics, which can provide some data-sharing advantages but is equivalent under restrictive conditions. The review and comparison both serve as a foundation for upcoming chapters that show how path analysis generalizes normal regression models.

References

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. http://dx.doi.org/10.18637/jss.v048.i02