Proper multiple imputation of clustered or panel data
Jul 26, 11:00
Allison (2001) states that the best solution to the missing data problem is prevention. This is especially true for complex data sets like clustered or panel data. Panel data are a subclass of clustered data, and both can be analyzed adopting multilevel models. Missingness may occur at various levels: in the outcome variable(s), in level-1 predictors, level-2 predictors, or even higher levels, and finally even in the group identifier(s). Many researchers still handle missingness (e.g. in multilevel data in level-1 and level-2 predictors) by excluding the incomplete cases from the analysis – a wasteful practice, which may lead to biased inferences. On the other hand, also none of the currently existing multiple imputation solutions for complex data can be described as optimal, as they either rely rather heavily upon strong distributional assumptions, often including homoscedasticity, which are frequently violated in “real life” situations. On the other hand, non- or semiparametric imputations methods often lack justification. Recent papers that contrast and review various strategies to impute complex clustered or panel data are Kleinke, Stemmler, Reinecke, and Lösel (2011), Drechsler (2015), Enders, Mistler, and Keller (2016), Grund, Lüdtke, and Robitzsch (2016), and Lüdtke, Robitzsch, and Grund (2017). Shortcomings of some imputation techniques or consequences of misspecifications even in simple data sets are considered, e.g. in de Jong, van Buuren and Spiess (2016) or He and Raghunathan (2009). All in all, missing data in complex data structures and specifically in panel data sets is a field where a lot of research still has to be done. Feasible and robust software solutions need to be developed that allow valid inferences, even when empirical data do not exactly follow the convenient statistical distributions assumed by the respective procedures (e.g. de Jong, van Buuren and Spiess, 2016).
The purpose of this paper is (a) to give an overview of recent research on multiple imputation of incomplete clustered or panel data, (b) to discuss advantages, and disadvantages of the respective approaches, and (c) to provide practical guidelines, which imputation technique supposedly works best in a given scenario. To this end, we present results of various Monte Carlo simulations, in which we investigate the consequences of misspecified imputation models on inferences in multilevel models. In particular, we consider distributions of the covariates that differ in skewness and curtosis, or ignorable missing mechanisms that differ in their selectivity.