Predicting data collection characteristics using partial data versus a Bayesian approach
Jul 26, 13:45
Understanding characteristics of the data collection process is critical for managing survey operations and costs, and for making improvements in future survey rounds. Information about how successful a particular data collection operation is at yielding response, or how the historical response behavior of sample persons predicts their future behavior can be used to assign cases to data collection operations and better predict survey outcomes, like expected costs and response rates or standard errors of estimates.
Often, expectations concerning data collection characteristics are developed from observations of past rounds of the specific survey in question, or from other similar surveys. Longitudinal surveys have covariates and paradata about past behavior of sample cases, since they are interviewed multiple times. Observations from these past survey implementations, either at a case level, subgroup level, or at a more general level may be combined in order to develop a priori expectations, such as the expected response rate per dollar of incentive in an upcoming survey round. During the current survey implementation, progress and cost data may be used to extrapolate end-of-survey characteristics and compare them to the expected characteristics based on the prior round. Inferences may include statements like, “given the current progress rate, incentives are not as effective as last round.”
This extrapolation assumes that partial data collected during the early part of a follow-up wave of a longitudinal survey is representative of data that will be collected later in that wave and can be reasonably compared to expectations based on historical waves. However, complicating factors in longitudinal surveys, such as attrition, make those assumptions questionable. As a result, investigators may make common sense adjustments that account for these difficulties, such as reducing the expected response rate by some percentage to reflect attrition.
Using data from the National Survey of College Graduates (NSCG), a longitudinal survey with a rotating panel design, this paper demonstrates a Bayesian method of combining historical and current data to obtain more accurate predictions of data collection characteristics. We compare the “true” end-of-wave parameter estimates to (a) current wave parameter estimates as each day’s data is aggregated; (b) historical priors; and (c) the posteriors of the prior plus the daily likelihood. We also incorporate uncertainty into these models to allow newer data, including current wave data, to contribute more to parameter estimation. This Bayesian approach results in predictions that are closer to true results than when using partial data.