lcacov {lcca}R Documentation

Latent-class analysis with covariates

Description

Fits a latent-class model to a set of categorical items and user-specified covariates using an EM algorithm.

Usage

lcacov(formula, data, freq, groups, nclass = 2, reference = 1,
   constrain.rhos = F, constrain.alphas = F, iseeds = NULL, 
   iter.max = 5000, tol = 1e-06, starting.values = NULL,
   flatten.rhos = 0, stabilize.alphas = 0, flatten.gammas = 0,
   se.method = "STANDARD", weights, clusters, strata, subpop) 

Arguments

formula

an object of class "formula" that determines the model to be fit; see DETAILS.

data

an optional data frame, list or environment containing the variables in the model. If data is not given, or if variables are not found in data, the variables are taken from environment(formula), typically the environment from which lca is called.

freq

optional frequency or count variable, to be used when data are aggregated.

groups

optional categorical grouping variable for multi-group analysis; see DETAILS.

nclass

number of latent classes to be fit. Response items are assumed to be conditionally independent within classes. There must be at least two latent classes.

reference

latent class to be used as the reference group in the baseline-category logistic model for class membership.

constrain.rhos

if T, then rho-parameters (item-response probabilities) are assumed to be equal across groups. If no groups variable is given, this argument is ignored.

constrain.alphas

if T, then alpha-parameters (logistic coefficients relating the covariates to probabilities of class membership) are assumed to be equal across groups. If no groups variable is given, this argument is ignored.

iseeds

two integers to initialize the random number generator; see DETAILS.

iter.max

maximum number of iterations to be performed. Each iteration consists of an Expectation or E-step followed by a Maximization or M-step. The procedure halts if it has not converged by this many iterations.

tol

convergence criterion. The procedure halts if the maximum absolute change in all parameters (gammas and rhos) from one iteration to the next falls below this value.

starting.values

optional starting values for the model parameters. This must be a list with two named components, alpha and rho, which are numeric arrays with correct dimensions; see DETAILS.

flatten.rhos

optional flattening constant for estimation of rho-paramaters (item-response probabilities); see DETAILS.

stabilize.alphas

optional constant for stabilizing the estimated logistic coefficients; see DETAILS.

flatten.gammas

optional flattening constant for estimation of marginal gamma-parameters (class prevalences); see DETAILS.

se.method

method to use for computing standard errors; see DETAILS.

weights

optional numeric variable containing sampling weights for survey data; see DETAILS.

clusters

optional integer or factor variable containing sampling cluster identifiers; see DETAILS.

strata

optional integer or factor variable containing sampling stratum identifiers; see DETAILS.

subpop

optional logical variable indicating a subpopulation to which the model is to be fit; see DETAILS.

Details

This function fits a latent-class model with covariates. Under this model, polytomous response variables are assumed to be conditionally independent given a latent classification variable. Individuals' probabilities of class membership are related to covariates by a baseline-category logistic model. Models of this type were described by Dayton and Macready (1988), Bandeen-Roche et al. (1997) and others.

The formula argument should have the form cbind(Y1,Y2,...)~X1+X2+.... The term on the left-hand side of ~ is a matrix (not a data frame) of variables to be used as responses. Each response variable should consist of integer codes 1,2,.... Response variables may also be factors, in which case they will be converted to integer codes (as in the function unclass), and the levels of the factors will be ignored. The terms on the right-hand side of formula are covariates that predict class membership.

If the right-hand side of formula is 1, then lcacov will fit a model that assumes individuals' class-membership probabilities are constant. The results will be equivalent to those from lca, except that the prevalences will be expressed as coefficients from an intercept-only logistic model, whereas lca expresses them as class-membership probabilities.

Missing values in response variables are allowed and should be conveyed by the R missing value code NA. Cases with missing responses are retained in the fitting procedure, and the missing values are assumed to be ignorably missing or missing at random.

The number of latent classes to be fit is determined by nclass. There must be at least two latent classes. To fit a one-class model, use lca.

By default, each case (row) of data or the model environment is assumed to represent one observational unit or individual. Data may also be aggegated, with individuals bearing identical responses to all variables (including NA's, if present) collapsed into a single case, with frequencies conveyed through the numeric variable freq.

The unknown parameters to be estimated are the logistic coefficients, which we call alpha's, and the item-response probabilities, which we call rho's. Estimation proceeds by an EM algorithm which, by default, computes maximum-likelihood estimates of these parameters. EM may converge to a global or local maximum, possibly on the boundary of the parameter space where one or more rho-parameters are zero. Sparseness in the data may cause one or more logistic coefficients to diverge to plus or minus infinity. These conditions may be remedied through the use of flattening and stabilizing constants, which are described below.

The function lcacov also estimates population-average class prevalences, which we call marginal gamma's; when covariates are present, these should be interpreted as overall rates of class membership, averaged over the distribution of the covariates.

This function supports multi-group analyses in which the alpha- or rho-parameters are allowed to vary across levels of a categorical variable groups. This feature is useful for testing invariance of measurement (i.e., equality of rho-paraneters) across groups. The groups variable, if present, should be integers coded as 1,2,...,ngroups, where ngroups is the number of distinct groups. The groups variable may also be a factor, and the levels of this factor will appear in the printed output summaries. If groups is not present, then ngroups is taken to be 1. Constraints on the group-specific parameters are conveyed by constrain.rhos and constrain.alphas.

Optional starting values for parameters are provided through starting.values. This argument should be a list with two named components, rho and alpha. The component rho should be an array of dimension c(nitems,maxlevs,nclass,ngroups), where nitems is the number of response variables on the left-hand side of formula, maxlevs is the maximum number of levels (distinct response categories) among the response variables, nclass is the number of latent classes, and ngroups is the number of groups (equal to 1 if groups is not supplied). The element starting.values$rho[j,k,c,g] is the probability that an individual in group g and class c supplies a response of k to item j. The component alpha should be an array of dimension c(ncovs,nclass,ngroups), where ncovs is the number of predictors in the logistic model (including a constant term for the intercept, if present). The elements of starting.values$alpha[,c,g] are the coefficients determining the log-odds of membership in class c, versus the reference class, for individuals in group g. If c is the reference class, then all elements of starting.values$alpha[,c,g] must be zero.

If starting.values is not supplied, or if starting.values$rho=NULL, then starting values for the rho-parameters will be randomly generated. This function uses its own internal random number generator which is seeded by two integers, for example, seeds=c(123,456), which allows results to be reproduced in the future. If seeds=NULL then the function will seed itself with two random integers from R. Therefore, results can also be made reproducible by calling set.seed beforehand and taking seeds=NULL. Different starting values for the rhos's may lead to solutions in which the classes have different orderings. To reorder the classes in the printed output summaries, use the function permute.class. If starting.values is not supplied, or if starting.values$alpha=NULL, then starting values for the alpha's will be set to zero, which implies uniform class-membership probabilities in all classes for all individuals.

Rho-parameters at or near a boundary, which are commonplace in latent-class analysis, create difficulty when computing derivative-based standard errors, because the likelihood function at the solution may not be log-concave. The argument flatten.rhos allows the user to supply a positive flattening constant to smooth the estimated rho's toward the interior of the parameter space. A value of flatten.rhos=1, which should be adequate in most cases, supplies information equivalent to one prior observation for each response item in each class and each group, distributed fractionally across the response categories in equal amounts.

During the estimation procedure, one or more estimated alpha-parameters may diverge toward plus or minus infinity, a condition that is sometimes called quasi-separation (Agresti, 2002). It suggests that all the individuals in a given class have identical values for one or more covariates, or that a group of covariates is collinear within a class. This tends to happen in sparse-data situations where the number of covariates is large or one or more classes are rare. If the logistic estimation procedure fails to converge, one possible remedy is to eliminate some covariates from the model. Another possibility is to introduce a small amount of prior information to stabilize the coefficients.. Clogg et al. (1991) described a procedure for sparse logistic regression that smooths the estimated slopes toward zero. It can be viewed as an application of a penalty function or a data-dependent prior distribution. In effect, this prior adds a fictitious fractional number of observations to each case in the dataset, distributed across classes in proportions determined by the class prevalences from a model without covariates. The total number of fictitious observations is equal to ncovs*ngroups multiplied by stabilize.alphas. Setting stabilize.alphas=1 should, in most cases, be adequate to solve the problem. If the value of stabilize.alphas is positive, then lcacov will first run the EM algorithm without covariates to estimate the class prevalences. Poor starting values may cause one or more class prevalences from this model to go to zero. Prevalences for the model without covariates may be flattened by the argument flatten.gammas, and a value of 1 should usually be adequate to solve the problem. If stabilize.alphas=0, then flatten.gammas will be ignored.

In summary, any problems due to sparseness, boundary estimates or poor starting values can usually be overcome by setting flatten.rhos=1, stabilize.alphas=1 and flatten.gammas=1.

Acceptable values for the argument se.method are "STANDARD","FAST", "SANDWICH" or "NONE". If "STANDARD", then standard errors are obtained by inverting the matrix of (minus one times) the second derivatives of the loglikelihood function (plus penalty terms for flattening constants, if present) at the solution. If "FAST", then the matrix of second derivatives is approximated by the sum of the outer products of individuals' contributions to the score functions (first derivatives). If "SANDWICH", then the sum of the score outer-products is pre- and post-multiplied by the inverse of the second derivatives. If "NONE", then computation of first- and second-derivatives and standard errors is suppressed.

Population average class prevalences, which we call marginal gamma's, are estmated by averaging the estimated posterior probabilities of class membership across individuals. This can be viewed as an application of “expected estimating equations” described by Wang et al. (2008). Standard errors for these parameters are computed by a sandwich procedure, regardless of the value of se.method.

Survey weights may be supplied via the argument weights. Weights should not be confused with frequencies as supplied by freq. A frequency of 10 indicates that ten individuals in the sample exhibited the given pattern of responses, but a survey weight of 10 indicates that one sampled individual is representing ten individuals in the population. The same variable supplied as freq or weights will lead to identical estimates, but the standard errors may be drastically different. You cannot supply both freq and weights; data from a survey with unequal probabilities of selection must be supplied in disaggregated form. If a weights variable is present, then se.method is automatically set to "SANDWICH", and the inner matrix of the sandich formula (i.e., the meat of the sandwich) is an estimated covariance matrix for the total quasi-score.

Weights provided with large, nationally-representative survey datasets are often very large, because their sum estimates the size of the population. Many data analysts are accustomed to rescaling survey weights to have a mean of one, so that they sum to the sample size rather than the population size. Rescaling a weights variable will have no effect on estimates or standard errors, because the scale factor cancels out when se.method="SANDWICH".

Additional information about the sample design may be supplied via clusters and strata. If weights is supplied, then the sampling plan is assumed to fall within the general class of with-replacement (WR) designs. At the first stage, clusters are drawn with replacement. Then individuals are selected within clusters, possibly with unequal probabilities, possibly in multiple stages. The clusters and strata variables should be integers or factors. The integers serve merely as identifiers; the actual values are unimportant. Cluster identifiers are assumed to be unique within strata, so that cluster 1 in stratum 1 and cluster 1 in stratum 2 are assumed to be different. If clusters is not supplied, then each sampled individual is assumed to be a cluster. If strata is not supplied, then one stratum is assumed for the whole population. Note a weight variable is required for complex survey data; if weights=NULL, then clusters and strata will be ignored.

It is often useful to fit a model that describes only a part of the full population (e.g., females). With a simple random sample, it is acceptable to remove the sampled individuals who are not in this subpopulation (e.g, males) from the data frame or model environent, because those who remain will be then a simple random sample of the subpopulation of interest. With a complex survey design, however, discarding individuals who do not belong to the subpopulation may lead to incorrect standard errors, because the overall design may not scale down to the subpopulation. To fit a model to a subpopulation when weights is present, supply a logical variable subpop whose elements are TRUE for members of the subpopulation. If subpop is supplied and weights=NULL, individuals outside of the subpopulation will be ignored when computing estimates and standard errors.

Value

A list whose class attribute has been set to "lcacov". A nicely formatted summary of this object may be seen by applying the summary method, but its components may also be accessed directly. Components which may be of interest include:

ncases

number of ncases (rows) from the data frame or model environment used in the procedure.

nitems

number of response variables appearing in the model.

nlevs

integer vector of length nitems indicating the number of levels or response categories for each item.

ngroups

number of groups present in a multi-group analysis, or 1 if groups=NULL.

iter

number of EM iterations performed.

converged

logical value indicating whether the algorithm converged by iter iterations.

loglik

vector of length iter reporting the value of the loglikelihood function at the beginning of each EM iteration. If weights is present, then this will be a weighted pseudo-loglikelihood.

loglik.final

value of the loglikelihood or pseudo-loglikelihood function after the final iteration.

logpost

vector of length iter reporting the value of the log-posterior density (loglikelihood or pseudo-loglikelihood plus penalty terms, if flattening or stabilizing constants are used) at the beginning of each EM iteration.

logpost.final

value of the log-posterior density after the final iteration.

AIC

Akaike's information criterion (smaller is better). Will be NA if weights is present, because AIC for pseudo-likelihood is not defined.

BIC

Bayesian information criterion (smaller is better). Will be NA if weights is present, because BIC for pseudo-likelihood is not defined.

param

estimated parameters after the final iteration. This is a list with two named components, rho and alpha, with same format as starting.values; see DETAILS.

post.probs

matrix of estimated posterior probabilities of class membership given the observed items. This is a matrix with rows corresponding to cases or rows of the dataset and columns corresponding to classes.

se.fail

logical value equal to TRUE if the computation of standard errors failed for any reason. Failure usually indicates that the model is under-identified or that one or more estimated parameters lie on or near a boundary, and the problem can often be resolved by increasing the value of flatten.rhos or flatten.gammas.

se.rho

array with same dimensions as param$rho containing the standard errors for param$rho.

se.alpha

array with same dimensions as param$alpha containing the standard errors for param$alpha.

dim.theta

number of free parameters estimated in the model. The free parameters correspond to the elements of param$alpha and param$rho, eliminating the alpha-parameters for the reference class which are constrained to be zero, and eliminating rho-parameters that are redundant due to the usual sum-to-one probability constraints.

theta

vector of length dim.theta containing the final estimates of the free parameters. Examine names(theta) to see the correspondence between the elements of theta and the contents of param.

cov.theta

estimated covariance matrix for the parameters in theta. How this is computed depends on se.method.

score

vector of first derivatives of the loglikelihood of pseudo-loglikelihood (plus penalty terms for flattening constants, if present) with respect to the free parameters in theta.

hessian

matrix of (minus one times) the second derivatives of the loglikelihood or pseudo-loglikehood (plus penalty terms for flattening constants, if present) with respect to the free parameters in theta.

sandwich.meat

inner matrix of the sandwich variance formula. This is an empirical estimated covariance matrix for score.

deff.trace

the design-effect trace, if weights is present. This is the sum of the diagonal elements of solve(hessian) multiplied by sandwich.meat. This quantity is used by compare.fit when comparing the pseudo-loglikelihood values for two models.

marg.gamma

estimated marginal class prevalences. This is a matrix of dimension c(nclass,ngroups), and marg.gamma[c,g] contains the estimated prevalence of class c within group g. In a multi-group analyses, the prevalences across groups may vary even if constrain.alphas=0 and constrain.rhos=0, because the distribution of covariates within the groups may vary.

se.marg.gamma

matrix with the same dimensions as marg.gamma containing the standard errors for marg.gamma.

cov.marg.gamma

covariance matrix for the estimated marginal class prevalences. To see the correspondence between this matrix and the elements of marg.gamma and se.marg.gamma, examine the dimnames of this matrix.

wald.stat.alpha

Wald-type statistics for testing the significance of each term on the right-hand side of formula. These Type III chi-square statistics, which are based on cov.theta, test the null hypothesis that the given covariate term can be removed from the logistic model within a group (if ngroups>1 and constrain.alphas=F) or from the entire model (if ngroups=1 or constrain.alphas=T). The degrees of freedom associated with each test are nclass-1.

wald.pval.alpha

significance values for the test statistics in wald.stat.alpha.

Author(s)

Joe Schafer

Send questions to mchelpdesk@psu.edu

References

Agresti, A. (2002) Categorical Data Analysis (2nd Ed.). New York: Wiley.

Bandeen-Roche, K., Miglioretti, D.L., Zeger, S.L., and Rathouz, P.R. (1997) Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375-1386.

Clogg, C.C., Rubin, D.B., Schenker, N., Schultz, B., and Weidman, L. (1991) Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, 68-78.

Dayton, C.M. and Macready, G.B. (1988) Concomitant-variable latent-class models. Journal of the American Statistical Association, 83, 173-178.

Wang, C.Y., Huang, Y., Chao, E.C. and Jeffcoat, M.K. (2008) Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics, 64, 85-95.

For more information on using this function and other functions in the LCCA package, see the manual LCCA Package for R, Version 1 in the subdirectory doc.

See Also

compare.fit, lca, lcacov.datasim, permute.class, summary.lcacov

Examples

## recode variables in abortion attitudes dataset,
## setting NAP and DK to missing
data(abortion)
abortion$ABANY    <- factor(abortion$ABANY,    levels=c("YES","NO"))
abortion$ABDEFECT <- factor(abortion$ABDEFECT, levels=c("YES","NO"))
abortion$ABHLTH   <- factor(abortion$ABHLTH,   levels=c("YES","NO"))
abortion$ABNOMORE <- factor(abortion$ABNOMORE, levels=c("YES","NO"))
abortion$ABPOOR   <- factor(abortion$ABPOOR,   levels=c("YES","NO"))
abortion$ABRAPE   <- factor(abortion$ABRAPE,   levels=c("YES","NO"))

## fit a three-class model with sex as a covariate
set.seed(456)
fit <- lcacov(
   cbind(ABANY,ABDEFECT,ABHLTH,ABNOMORE,ABPOOR,ABRAPE) ~ SEX,
   data=abortion, nclass=3,
   weights=WTSSNR, flatten.rhos=1 )
summary(fit, show.wald.alpha=TRUE)


[Package lcca version 2.0.0 Index]