LCA and LTA Modeling FAQs

This page addresses FAQs about latent class and latent transition modeling. For questions about PROC LCA and PROC LTA in SAS software, see the LCA and LTA Software FAQ.

Overview of LCA and LTA

What are latent class analysis and latent transition analysis?

Latent class analysis (LCA) is a modeling technique based on the idea that individuals can be divided into subgroups based on an unobservable construct. The construct of interest is the latent variable and the subgroups are called latent classes. True latent class membership is unknown for each individual due to measurement error, but we infer an individual’s membership by measuring the construct with multiple indicators. The indicators are typically categorical; when indicators are continuous we typically refer to it as latent profile analysis (LPA). The latent classes are assumed to be mutually exclusive and exhaustive. Thus, each individual belongs to one and only one latent class, but we are not certain which class due to measurement error.

LCA typically uses cross-sectional data to identify subgroups at a single time point; in this sense we think of class membership as being static. Latent transition analysis (LTA) is an extension of LCA used with longitudinal data where individuals transition between latent classes over time; in this sense we think of class membership as being dynamic and class membership represents a developmental stage. In LTA, development is represented as movement through the stages over time and the technique is particularly well-suited to testing stage-sequential developmental theories (e.g., the transtheoretical model); different individuals may take different paths through the stages.

What types of variables do I need for LCA or LTA?

Latent class variables can be measured with categorical items (this model is referred to as latent class analysis) or continuous items (this model is referred to as latent profile analysis). PROC LCA and PROC LTA require categorical, manifest variables as indicators of the latent variables. However, note that indicators need not be binary (such as yes/no) but can have three or more unordered categories (such as Democrat, Republican, Independent). PROC LCA and PROC LTA require categorical manifest variables to measure categorical latent variables.

What is the difference between latent variables and manifest variables?

Latent variables are unobserved variables that are measured by multiple observed variables (also called manifest variables, items, or indicators of the latent variables). Most often, the manifest variables correspond to questionnaire items.

Continuous latent variables may be more familiar. This is what factor analysis is designed to measure; the factor model assumes that an individual’s true score along a continuum is not known. Categorical latent variables, also called latent class variables, can be measured with categorical items (this is LCA) or continuous items (this is latent profile analysis). LCA posits that an individual’s true class membership is not known but must be inferred from a set of manifest variables.

Do I need cross-sectional or longitudinal data?

This answer depends on the kind of model you wish to fit. If you are interested in determining the number of latent classes at a single measurement occasion, or in identifying and describing the number of latent classes at a single measurement occasion, you are interested in LCA and therefore cross-sectional data are all that you need. You can fit LCAs using a variety of software packages, including PROC LCA, Mplus, and Latent Gold. Instead, if you are interested in estimating transitions between latent classes over time, you are interested in LTA and therefore you require longitudinal data (i.e., two or more measurement occasions). You can fit LTAs using PROC LTA and Mplus.

In a special type of LCA, called a repeated measures LCA, longitudinal data may be used with LCA so that the latent classes represent trajectories across multiple measurement occasions. For more information about this special type of LCA, see Lanza and Collins (2006).

Lanza, S. T., & Collins, L. M. (2006). A mixture model of discontinuous development in heavy drinking from ages 18 to 30: The role of college enrollment. Journal of Studies on Alcohol, 67, 552-561.

How is LCA different from cluster analysis?

For a concise discussion comparing LCA to K-means cluster analysis, see the following article:

Magidson, J., & Vermunt, J. K. (2002). Latent class models for clustering: A comparison with K-means. Canadian Journal of Marketing Research, 20, 37-44.

Available from Statistical Innovations at www.statisticalinnovations.com.

How are longitudinal latent class analysis (LLCA) and LTA different?

Longitudinal latent class analysis (LLCA) and latent transition analysis (LTA) are two different approaches to modeling change over time in a construct that is discrete, as opposed to continuous. (Very often, continuous change over time is modeled using growth curve analysis, such that the population mean level is estimated as a smooth function of time.) Discrete change may be quantified using a single indicator of the outcome at each assessment time point, or using multiple indicators at each time point. Multiple indicators would be used to measure latent class membership at each time point.

LTA estimates latent class membership at time t+1 conditional on membership at time t; in other words, individuals’ probabilities of transitioning from a particular latent class at time t to another latent class at time t+1 are estimated. LTA is a Markov model, estimating transitions from Time 1 to Time 2, Time 2 to Time 3, and so on. This allows one to estimate, for example, the probability of membership in a Heavy Drinking latent class at Time 2 given that one belonged to the Non-User latent class at Time 1.

In contrast, LLCA – also referred to as repeated-measures latent class analysis (RMLCA) – is a latent class model where the indicators of the latent class include one or more variables assessed at multiple time points. In concept, this approach is analogous to growth curve modeling in that patterns of responses across all time points are characterized, except that in LCCA change over time is discrete. Lanza & Collins (2006) present an introduction to LCCA where patterns of heavy drinking across six time points are examined, and membership in those developmental patterns are predicted from college enrollment.

The book by Collins & Lanza (2010) describes the differences between LLCA and LTA in greater detail (see Chapter 7).

Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. New York: Wiley.

Lanza, S. T., & Collins, L. M. (2006). A mixture model of discontinuous development in heavy drinking from ages 18 to 30: The role of college enrollment. Journal of Studies on Alcohol, 67, 552-561.

Fitting Your Model

How big should my sample size be in order to conduct LCA or LTA?

This depends on a number of factors, such as the overall size of the contingency table and how saturated the model is. In our experience, LTA works best with sample sizes of at least 300 or greater. Larger samples may be needed for some problems, particularly those where many indicators are involved or the model is complex. Remember that in many cases it is reasonable (and often desirable) to restrict measurement error (i.e., constrain rho parameters to be equal) across times and/or groups.

What should I do if I have many variables and I want to include them all as indicators?

We recommend creating composite variables for use as indicators. This is often a great way to obtain good manifest indicators.

What does it mean that the model has an identification problem?

In order for parameter estimation to proceed properly, there must be enough independent information available from data to produce the parameter estimates. Identification problems tend to occur under the following conditions: a lot of parameters are being estimated; the sample size is small in relation to the maximum possible number of response patterns; or the rho parameters are close to 1/number of response categories rather than closer to zero or one. Often, adding reasonable parameter restrictions in order to reduce the number of parameters being estimated will help to achieve an identified model.

What does a negative G-squared value mean?

A model should never have a negative G2. A negative G2 value signals that something is wrong, often with the input data. For example, a negative G2 value may be caused by a mistake in reading in the data, such as inputting data as response pattern proportions rather than integer frequencies.

Why is the number of degrees of freedom in my model negative?

A model should never have negative degrees of freedom. Degrees of freedom (df) are equal to the number of possible cells (k) minus the number of parameters estimated (p) minus one (df=k-p-1). A model will have negative degrees of freedom when the model is trying to estimate more parameters than it is possible to estimate. If you have negative degrees of freedom, reduce the number of latent classes or latent statuses, or add parameter restrictions to reduce the number of parameters being estimated.

Do I have to impose equality constraints on measurement error (i.e., rho) parameters across time?

No, you do not have to impose equality constraints. However, it is often a good idea to do so, because this keeps the meaning of the latent statuses the same over time. This corresponds to the idea of factor invariance in factor analysis. Sometimes, however, you may want to explore the latent class structure separately for each time to get a sense of what underlying groups there may be in your population. You also may want to model multiple times together without restricting measurement to be equal over time. In this case, the number of classes generally has to be equal unless you use a flexible structural equation modeling package that allows you to condition class membership at time 2 on class membership at time 1. One thing to keep in mind is that if a latent class does not exist at time 1 but does at time 2, it is okay for the class membership probability for that class to be (essentially) zero at time 1, with people transitioning into it at time 2. This could be substantively interesting. Also keep in mind that if you allow measurement error to vary across time, it is a good idea to run multiple sets of starting values because identification may be difficult with these larger models.

Why do some models with multiple time points take so long to run?

Run times increase exponentially as you add time points, especially when there are missing data and/or when there are many indicator variables with many levels. When there are missing data, run times can get very long for large numbers of indicators and/or levels per indicator. If your indicators have many levels, it may help to re-code your indicators so that they have 2 levels. Alternatively, if there are not very many individuals with missing data, you could try removing those cases, if it is appropriate.

Selecting Your Model

How do I assess the fit of my model?

PROC LCA and PROC LTA provide the likelihood-ratio chi-square statistic, denoted G2, as well as the AIC and BIC information criteria. Goodness-of-fit, or absolute model fit, can be assessed by comparing the observed response pattern proportions to the response pattern proportions predicted by the model. If the model as estimated is a good representation of the data, then it will predict the response pattern proportions with a high degree of accuracy. A poor model will not be able to reproduce the observed response pattern proportions very well. The G2 statistic expresses the correspondence between the observed and predicted response patterns. For ordinary contingency table models the G2 is distributed as a chi-squared; unfortunately, for large contingency table models common in LTA, the chi-square becomes an inaccurate approximation of the G2 distribution. A very rough rule of thumb is that a good model has a goodness-of-fit statistic (G2 value) lower than the degrees of freedom. Relative model fit (that is, deciding which of several models is optimal in terms of balancing fit and parsimony) can be assessed with the AIC and BIC. Models with lower AIC and BIC are optimal.

What do I do if the AIC and BIC do not agree?

AIC and BIC are both penalized-likelihood criteria. AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Each, despite its heuristic usefulness, has therefore been criticized as having questionable validity for real-world data. But despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC. In general, it might be best to use AIC and BIC together in model selection. For example, in selecting the number of latent classes in a model, if BIC points to a three-class model and AIC points to a five-class model, it makes sense to select from models with 3, 4 and 5 latent classes. AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

Interpreting Your Model

How do I interpret the measurement error (i.e., rho) parameters? Is the latent variable being measured well by the indicators in the model?

You are probably familiar with another latent variable model, factor analysis. In factor analysis, a manifest variable’s loading on a factor represents the relation between the variable and the factor. Because factor loadings are regression coefficients, a factor loading of zero represents no relation between the manifest variable and the factor, whereas larger factor loadings reflect a stronger relation. In other words, all else being equal, when factor loadings are large the latent variable is being measured better. In latent class and latent transition models, rho parameters play the same conceptual role as factor loadings; however, they are NOT regression coefficients, so they are scaled differently and their interpretation is somewhat different. Rho parameters are probabilities. The closer the rho parameters are to zero and one, the more the responses to the manifest variable are determined by the latent class or latent status. The closer the rho parameters are to 1/(number of response alternatives) – this is .5 for binary variables – the weaker the relation between the manifest variable and the latent class/status. In other words, all else being equal, when rho parameters are close to zero and one, the latent variable is being measured better. Another consideration is the overall pattern of rho parameters. Ideally, the pattern of rho parameters clearly identifies latent classes/statuses with distinguishable interpretations. This is similar to the concept of simple structure in factor analysis.

Advanced Questions

Why might I want to use cross-validation?

Cross-validation can be an alternative to traditional goodness-of-fit testing when there are several plausible models to be compared, and there are problems associated with the distribution of the G2 fit statistic, such as when the sample size is small. Crossvalidation involves splitting a sample into two (or more) subsamples, for example, Sample A and Sample B, and fitting a series of plausible models to each sample. Each model is fitted to Sample A (the calibration sample), the predicted response frequencies for each model are compared to the observed response frequencies in Sample B (the crossvalidation sample), and G2 is computed. Then the reverse is done; each model is fitted to Sample B (now the calibration sample), the predicted response frequencies for this model are compared to the observed response frequencies in Sample A (now the crossvalidation sample), and another G2 is computed. A model crossvalidates well if the G2 is relatively small when the estimated model is applied to a crossvalidation sample. When a series of models is tested, the model or models that crossvalidate best are considered best-fitting.

Can violations of the assumption of local independence among manifest variables be assessed?

One straight-forward approach is to assign individuals to the latent class in which they have the highest posterior probability of membership (these probabilities can be saved in a SAS data file using the OUTPOST option). Then, relationships among all indicators of your latent class variable can be explored separately for each group of individuals (i.e., for each class).

More sophisticated (and statistically sound) ways to explore local dependence have been explored. One of these procedures is outlined by Bandeen-Roche, Migloretti, Zeger, and Rathouz (1997), where they multiply impute latent class membership and look for violations of this assumption within each imputation.

Can I assign individuals to latent classes or latent statuses based on their posterior probabilities?

We do not recommend assigning individuals to latent classes or latent statuses based on their posterior probabilities unless there is no viable alternative. By assigning individuals to latent classes or latent statuses, you introduce error into your results. There are many different types of analyses that can be performed within the latent class modeling framework (e.g., predicting latent class membership) without having to assign individuals to latent classes or latent statuses. When possible, we recommend working within the latent class modeling framework because it incorporates measurement error into the model, which is ignored by class/status assignment. If you are planning to assign individuals based on posterior probabilies, one article of interest might be:

Goodman, L. A. (2007). On the assignment of individuals to latent classes. Sociological Methodology, 37(1), 1-22. doi: 10.1111/j.1467-9531.2007.00184.x

In this paper, Goodman describes two ways to assign individuals and two criteria that can be used to assess when class assignment is satisfactory and when it is not. If you assign individuals to classes/statuses, we recommend evaluating the amount of measurement error introduced by doing so.

Update: Recent work on measurement error weighting with modal class assignment based on posterior probabilities has proposed a high-quality way to reduce attenuation when assigning individuals to latent classes and conducting a follow-up analysis. In simulation studies, this approach has been shown to work quite well and is robust to violations of homoscedasticity across classes (e.g., in an outcome). This approach, commonly referred to as the “BCH approach”, is the currently recommended approach to latent class analysis with a continuous or binary outcome. You can read more about this approach in the following articles:

Bakk, Z., & Vermunt, J. K. (2016). Robustness of stepwise latent class modeling with continuous distal outcomes. Structural Equation Modeling, 23, 20-31. doi: 10.1080/10705511.2014.955104

Dziak, J. J., Bray, B. C., Zhang, J.-T., Zhang, M., & Lanza, S. T. (2016). Comparing the performance of improved classify-analyze approaches for distal outcomes in latent profile analysis. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 12, 107-116. doi: 10.1027/1614-2241/a000114 PMCID: In process

Let’s stay in touch.

We are in this together. Receive an email whenever a new model or resource is added to the Knowledge Base.