lcacov {lcca} | R Documentation |
Fits a latent-class model to a set of categorical items and user-specified covariates using an EM algorithm.
lcacov(formula, data, freq, groups, nclass = 2, reference = 1, constrain.rhos = F, constrain.alphas = F, iseeds = NULL, iter.max = 5000, tol = 1e-06, starting.values = NULL, flatten.rhos = 0, stabilize.alphas = 0, flatten.gammas = 0, se.method = "STANDARD", weights, clusters, strata, subpop)
formula |
an object of class |
data |
an optional data frame, list or environment containing
the variables in the model. If |
freq |
optional frequency or count variable, to be used when data are aggregated. |
groups |
optional categorical grouping variable for multi-group analysis; see DETAILS. |
nclass |
number of latent classes to be fit. Response items are assumed to be conditionally independent within classes. There must be at least two latent classes. |
reference |
latent class to be used as the reference group in the baseline-category logistic model for class membership. |
constrain.rhos |
if |
constrain.alphas |
if |
iseeds |
two integers to initialize the random number generator; see DETAILS. |
iter.max |
maximum number of iterations to be performed. Each iteration consists of an Expectation or E-step followed by a Maximization or M-step. The procedure halts if it has not converged by this many iterations. |
tol |
convergence criterion. The procedure halts if the maximum absolute change in all parameters (gammas and rhos) from one iteration to the next falls below this value. |
starting.values |
optional starting values for the model
parameters. This must be a list with two named components,
|
flatten.rhos |
optional flattening constant for estimation of rho-paramaters (item-response probabilities); see DETAILS. |
stabilize.alphas |
optional constant for stabilizing the estimated logistic coefficients; see DETAILS. |
flatten.gammas |
optional flattening constant for estimation of marginal gamma-parameters (class prevalences); see DETAILS. |
se.method |
method to use for computing standard errors; see DETAILS. |
weights |
optional numeric variable containing sampling weights for survey data; see DETAILS. |
clusters |
optional integer or factor variable containing sampling cluster identifiers; see DETAILS. |
strata |
optional integer or factor variable containing sampling stratum identifiers; see DETAILS. |
subpop |
optional logical variable indicating a subpopulation to which the model is to be fit; see DETAILS. |
This function fits a latent-class model with covariates. Under this model, polytomous response variables are assumed to be conditionally independent given a latent classification variable. Individuals' probabilities of class membership are related to covariates by a baseline-category logistic model. Models of this type were described by Dayton and Macready (1988), Bandeen-Roche et al. (1997) and others.
The formula
argument should have the form
cbind(Y1,Y2,...)~X1+X2+...
. The term on the left-hand
side of ~
is a matrix (not a data frame) of variables to be
used as responses. Each response variable should consist of integer
codes 1,2,...
. Response variables may
also be factors, in which case they will be converted to integer codes
(as in the function unclass
), and the levels
of the
factors will be ignored. The terms on the right-hand side of
formula
are covariates that predict class membership.
If the right-hand side of formula
is
1
, then lcacov
will fit a model that assumes
individuals' class-membership probabilities are constant. The results
will be equivalent to those from lca
, except that the
prevalences will be expressed as coefficients
from an intercept-only logistic model, whereas lca
expresses
them as class-membership probabilities.
Missing values in response variables are allowed and should be
conveyed by the R missing value code NA
. Cases with missing
responses are retained in the fitting procedure, and the missing
values are assumed to be ignorably missing or missing at random.
The number of latent classes to be fit is determined by
nclass
. There must be at least two latent classes. To fit a
one-class model, use lca
.
By default, each case (row) of data
or the model environment is
assumed to represent one observational unit or individual. Data may
also be aggegated, with individuals bearing identical responses to all
variables (including NA
's, if present) collapsed into a single
case, with frequencies conveyed through the numeric variable
freq
.
The unknown parameters to be estimated are the logistic coefficients, which we call alpha's, and the item-response probabilities, which we call rho's. Estimation proceeds by an EM algorithm which, by default, computes maximum-likelihood estimates of these parameters. EM may converge to a global or local maximum, possibly on the boundary of the parameter space where one or more rho-parameters are zero. Sparseness in the data may cause one or more logistic coefficients to diverge to plus or minus infinity. These conditions may be remedied through the use of flattening and stabilizing constants, which are described below.
The function lcacov
also estimates population-average class
prevalences, which we call marginal gamma's; when covariates are
present, these should be interpreted as overall rates of class
membership, averaged over the distribution of the covariates.
This function supports multi-group analyses in which the
alpha- or rho-parameters are allowed to vary across levels of a
categorical variable groups
. This feature is useful
for testing invariance of measurement (i.e., equality of
rho-paraneters) across groups. The groups
variable, if present,
should be integers coded as 1,2,...,ngroups
, where
ngroups
is the number of distinct groups. The groups
variable may also be a factor, and the levels
of this factor
will appear in the printed output summaries. If groups
is not
present, then ngroups
is taken to be 1
. Constraints on
the group-specific parameters are conveyed by constrain.rhos
and constrain.alphas
.
Optional starting values for parameters are provided through
starting.values
. This argument should be a list with two named
components, rho
and alpha
. The component rho
should be an array of dimension
c(nitems,maxlevs,nclass,ngroups)
, where nitems
is the
number of response variables on the left-hand side of formula
,
maxlevs
is the maximum number of levels (distinct response
categories) among the response variables, nclass
is the number
of latent classes, and ngroups
is the number of groups (equal
to 1
if groups
is not supplied). The element
starting.values$rho[j,k,c,g]
is the probability that an
individual in group g
and class c
supplies a response of
k
to item j
. The component alpha
should be an
array of dimension c(ncovs,nclass,ngroups)
, where ncovs
is the number of predictors in the logistic model (including a
constant term for the intercept, if present). The elements of
starting.values$alpha[,c,g]
are the coefficients determining
the log-odds of membership in class c
, versus the reference
class, for individuals in group g
. If c
is the
reference class, then all elements of
starting.values$alpha[,c,g]
must be zero.
If starting.values
is not supplied, or if
starting.values$rho=NULL
, then starting values for the
rho-parameters will be randomly
generated. This function uses its own internal random number generator
which is seeded by two integers, for example, seeds=c(123,456)
,
which allows results to be reproduced in the future. If
seeds=NULL
then the function will seed itself with two random
integers from R. Therefore, results can also be made reproducible by
calling set.seed
beforehand and taking
seeds=NULL
. Different starting values for the rhos's may lead
to solutions in which the classes have different orderings. To
reorder the classes in the printed output summaries, use the function
permute.class
. If starting.values
is not
supplied, or if starting.values$alpha=NULL
, then starting
values for the alpha's will be set to zero, which implies uniform
class-membership probabilities in all classes for all individuals.
Rho-parameters at or near a boundary, which are commonplace in
latent-class analysis, create difficulty when computing
derivative-based standard errors, because the likelihood function at
the solution may not be log-concave. The argument flatten.rhos
allows the user to supply a positive flattening constant to smooth the
estimated rho's toward the interior of the parameter space. A value
of flatten.rhos=1
, which should be adequate in most cases,
supplies information equivalent to one prior observation for each
response item in each class and each group, distributed fractionally
across the response categories in equal amounts.
During the estimation procedure, one or more estimated
alpha-parameters may diverge toward plus or minus infinity, a
condition that is sometimes called quasi-separation (Agresti, 2002).
It suggests that all the individuals in a given class have identical
values for one or more covariates, or that a group of covariates is
collinear within a class. This tends to happen in sparse-data
situations where the number of covariates is large or one or more
classes are rare. If the logistic estimation procedure fails to
converge, one possible remedy is to eliminate some covariates from the
model. Another possibility is to introduce a small amount of prior
information to stabilize the coefficients.. Clogg et al. (1991)
described a procedure for sparse logistic regression that smooths the
estimated slopes toward zero. It can be viewed as an application of a
penalty function or a data-dependent prior distribution.
In effect, this prior adds
a fictitious fractional number of observations to each case in the
dataset, distributed across classes in proportions determined by the
class prevalences from a model without covariates. The total number of
fictitious observations is equal to ncovs*ngroups
multiplied by
stabilize.alphas
. Setting stabilize.alphas=1
should, in
most cases, be adequate to solve the problem.
If the value of stabilize.alphas
is positive, then lcacov
will first run the EM algorithm without covariates to estimate the
class prevalences. Poor starting values may cause one or more class
prevalences from this model to go to zero. Prevalences for the
model without covariates may be flattened by the argument
flatten.gammas
, and a value of 1
should usually be
adequate to solve the problem. If stabilize.alphas=0
, then
flatten.gammas
will be ignored.
In summary, any problems due to
sparseness, boundary estimates or poor starting values can usually be
overcome by setting flatten.rhos=1
, stabilize.alphas=1
and flatten.gammas=1
.
Acceptable values for the argument se.method
are
"STANDARD"
,"FAST"
, "SANDWICH"
or "NONE"
.
If "STANDARD"
, then standard errors are obtained by inverting the
matrix of (minus one times) the second derivatives of the loglikelihood
function (plus penalty terms for flattening constants, if present) at
the solution. If "FAST"
, then the matrix of second derivatives
is approximated by the sum of the outer products of individuals'
contributions to the score functions (first derivatives). If
"SANDWICH"
, then the sum of the score outer-products is pre-
and post-multiplied by the inverse of the second derivatives. If
"NONE"
, then computation of first- and second-derivatives and
standard errors is suppressed.
Population average class prevalences, which we call marginal gamma's,
are estmated by averaging the estimated posterior probabilities of
class membership across individuals. This can be viewed as an
application of “expected estimating equations” described by Wang et
al. (2008). Standard errors for these parameters are computed by a
sandwich procedure, regardless of the value of se.method
.
Survey weights may be supplied via the argument weights
.
Weights should not be confused with frequencies as supplied by
freq
. A frequency of 10
indicates that ten individuals
in the sample exhibited the given pattern of responses, but a survey
weight of 10
indicates that one sampled individual is
representing ten individuals in the population. The same variable
supplied as freq
or weights
will lead to identical
estimates, but the standard errors may be drastically different. You
cannot supply both freq
and weights
; data from a survey
with unequal probabilities of selection must be supplied in
disaggregated form. If a weights
variable is present, then
se.method
is automatically set to "SANDWICH"
, and the
inner matrix of the sandich formula (i.e., the meat of the sandwich)
is an estimated covariance matrix for the total quasi-score.
Weights provided with large, nationally-representative survey datasets
are often very large, because their sum estimates the size of the
population. Many data analysts are accustomed to rescaling survey
weights to have a mean of one, so that they sum to the sample size
rather than the population size. Rescaling a weights
variable
will have no effect on estimates or standard errors, because the scale
factor cancels out when se.method="SANDWICH"
.
Additional information about the sample design may be supplied via
clusters
and strata
. If weights
is supplied,
then the sampling plan is assumed to fall within the general class of
with-replacement (WR) designs. At the first stage, clusters are drawn
with replacement. Then individuals are selected within clusters,
possibly with unequal probabilities, possibly in multiple stages. The
clusters
and strata
variables should be integers or
factors. The integers serve merely as identifiers; the actual values
are unimportant. Cluster identifiers are assumed to be unique within
strata, so that cluster 1
in stratum 1
and cluster
1
in stratum 2
are assumed to be different. If
clusters
is not supplied, then each sampled individual is
assumed to be a cluster. If strata
is not supplied, then one
stratum is assumed for the whole population. Note a weight variable is
required for complex survey data; if weights=NULL
, then
clusters
and strata
will be ignored.
It is often useful to fit a model that describes only a part of the
full population (e.g., females). With a simple random sample, it is
acceptable to remove the sampled individuals who are not in this
subpopulation (e.g, males) from the data frame or model environent,
because those who remain will be then a simple random sample of the
subpopulation of interest. With a complex survey design, however,
discarding individuals who do not belong to the subpopulation may lead
to incorrect standard errors, because the overall design may not scale
down to the subpopulation. To fit a model to a subpopulation when
weights
is present, supply a logical variable subpop
whose elements are TRUE
for members of the subpopulation. If
subpop
is supplied and weights=NULL
, individuals outside
of the subpopulation will be ignored when computing estimates and
standard errors.
A list whose class
attribute has been set to "lcacov"
.
A nicely formatted summary of this object may be seen by applying
the summary
method, but its components may also be accessed
directly. Components which may be of interest include:
ncases |
number of ncases (rows) from the data frame or model environment used in the procedure. |
nitems |
number of response variables appearing in the model. |
nlevs |
integer vector of length |
ngroups |
number of groups present in a multi-group analysis,
or |
iter |
number of EM iterations performed. |
converged |
logical value indicating whether the algorithm
converged by |
loglik |
vector of length |
loglik.final |
value of the loglikelihood or pseudo-loglikelihood function after the final iteration. |
logpost |
vector of length |
logpost.final |
value of the log-posterior density after the final iteration. |
AIC |
Akaike's information criterion (smaller is better). Will
be |
BIC |
Bayesian information criterion (smaller is better). Will
be |
param |
estimated parameters after the final
iteration. This is a list with two named components, |
post.probs |
matrix of estimated posterior probabilities of class membership given the observed items. This is a matrix with rows corresponding to cases or rows of the dataset and columns corresponding to classes. |
se.fail |
logical value equal to |
se.rho |
array with same dimensions as |
se.alpha |
array with same dimensions as |
dim.theta |
number of free parameters estimated in the
model. The free parameters correspond to the elements of
|
theta |
vector of length |
cov.theta |
estimated covariance matrix for the
parameters in |
score |
vector of first derivatives of the loglikelihood of
pseudo-loglikelihood (plus penalty terms for flattening constants,
if present)
with respect to the free parameters in |
hessian |
matrix of (minus one times) the second derivatives of
the loglikelihood or pseudo-loglikehood (plus penalty terms for
flattening constants, if present) with respect to the free parameters in
|
sandwich.meat |
inner matrix of the sandwich variance formula.
This is an empirical estimated covariance matrix for |
deff.trace |
the design-effect trace, if |
marg.gamma |
estimated marginal class prevalences. This is a
matrix of dimension |
se.marg.gamma |
matrix with the same dimensions as
|
cov.marg.gamma |
covariance matrix for the estimated marginal
class prevalences. To see the correspondence between this matrix
and the elements of |
wald.stat.alpha |
Wald-type statistics for testing the
significance of each term on the right-hand side of
|
wald.pval.alpha |
significance values for the test statistics
in |
Joe Schafer
Send questions to mchelpdesk@psu.edu
Agresti, A. (2002) Categorical Data Analysis (2nd Ed.). New York: Wiley.
Bandeen-Roche, K., Miglioretti, D.L., Zeger, S.L., and Rathouz, P.R. (1997) Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375-1386.
Clogg, C.C., Rubin, D.B., Schenker, N., Schultz, B., and Weidman, L. (1991) Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, 68-78.
Dayton, C.M. and Macready, G.B. (1988) Concomitant-variable latent-class models. Journal of the American Statistical Association, 83, 173-178.
Wang, C.Y., Huang, Y., Chao, E.C. and Jeffcoat, M.K. (2008) Expected
estimating equations for missing data, measurement error, and
misclassification, with application to longitudinal nonignorable
missing data. Biometrics, 64, 85-95.
For more information on using this function and other functions in
the LCCA package, see the manual LCCA Package for R, Version 1
in the subdirectory doc
.
compare.fit
,
lca
,
lcacov.datasim
,
permute.class
,
summary.lcacov
## recode variables in abortion attitudes dataset, ## setting NAP and DK to missing data(abortion) abortion$ABANY <- factor(abortion$ABANY, levels=c("YES","NO")) abortion$ABDEFECT <- factor(abortion$ABDEFECT, levels=c("YES","NO")) abortion$ABHLTH <- factor(abortion$ABHLTH, levels=c("YES","NO")) abortion$ABNOMORE <- factor(abortion$ABNOMORE, levels=c("YES","NO")) abortion$ABPOOR <- factor(abortion$ABPOOR, levels=c("YES","NO")) abortion$ABRAPE <- factor(abortion$ABRAPE, levels=c("YES","NO")) ## fit a three-class model with sex as a covariate set.seed(456) fit <- lcacov( cbind(ABANY,ABDEFECT,ABHLTH,ABNOMORE,ABPOOR,ABRAPE) ~ SEX, data=abortion, nclass=3, weights=WTSSNR, flatten.rhos=1 ) summary(fit, show.wald.alpha=TRUE)