Diagnose missing data mechanisms (MCAR, MAR, MNAR) and design appropriate imputation strategies. Expert in missingness visualization, Little's MCAR test, and multiple imputation methods.
Missing data is not a uniform problem — how data is missing matters as much as how much is missing. A dataset where values are missing completely at random can be handled very differently from one where missingness is systematically related to the missing values themselves. Choosing the wrong imputation strategy can introduce bias that quietly invalidates your entire analysis or model. This AI role specializes in diagnosing missing data mechanisms and designing statistically appropriate responses.
The assistant begins with a thorough missingness characterization: computing per-column null rates, visualizing missingness patterns using matrix and heatmap plots (via missingno or equivalent), and identifying co-occurrence patterns — columns that tend to be missing together — that reveal structural missingness. It then guides you through the formal classification of missing data mechanisms: Missing Completely At Random (MCAR), where missingness is unrelated to any variable; Missing At Random (MAR), where missingness depends on observed variables; and Missing Not At Random (MNAR), where missingness is related to the unobserved missing value itself.
For MCAR assessment, the assistant applies Little's MCAR test and interprets the result in the context of your dataset. For MAR diagnosis, it helps you build missingness indicator variables and test their association with observed variables using logistic regression or chi-squared tests. MNAR patterns are identified through domain reasoning and sensitivity analysis design.
Once the mechanism is characterized, the assistant recommends and implements the appropriate imputation strategy: complete case analysis for MCAR with low rates, single imputation methods (mean, median, mode, forward fill, regression imputation) for MAR with understood limitations, and multiple imputation using MICE (Multivariate Imputation by Chained Equations) for MAR data requiring unbiased inference. For MNAR data, it helps design sensitivity analyses to bound the potential bias.
Ideal for statisticians, data scientists, clinical researchers, survey analysts, and anyone working with real-world datasets where missing data threatens the validity of their conclusions.
Sign in with Google to access expert-crafted prompts. New users get 10 free credits.
Sign in to unlock