How do you handle missing data in research?

Start by classifying why the data is missing, whether completely at random, at random, or not at random, because the reason determines which method is safe. Then choose an approach that fits, typically multiple imputation or maximum likelihood estimation rather than simply deleting cases. Finally, document the extent of missingness, your assumed mechanism, and the method, and report a sensitivity check.

What are the two common techniques to handle missing data?

The two most common techniques are deletion and imputation. Deletion, such as listwise deletion, removes cases with missing values and is safe only when data are missing completely at random. Imputation fills the gaps with estimated values, with multiple imputation being the stronger and more defensible form because it reflects the uncertainty of the missing values.

What is a useful strategy to use when you are missing data?

A useful strategy is multiple imputation, which generates several plausible complete versions of the dataset, analyses each, and pools the results so the final estimates carry the uncertainty introduced by the gaps. It generally outperforms single-value methods such as mean substitution. Pairing it with a complete-case sensitivity analysis shows your conclusions do not depend on the handling choice.

How to deal with missing values in the collected data?

First tabulate how much is missing per variable and per case, and check whether the gaps cluster in a meaningful pattern. Decide on the missingness mechanism, then apply a method that suits it, preferring multiple imputation or maximum likelihood over crude deletion or mean substitution. Re-run distributional checks after imputation and keep a full record of every decision for reproducibility.

Handling missing data in your dissertation

Handling missing data means deciding, in a principled way, what to do about the gaps in your dataset so they do not bias your results. The defensible approach starts by classifying why values are missing, then choosing a method that fits, usually some form of imputation rather than simply deleting cases. Done carelessly, missing values quietly distort every analysis in your dissertation; done well, the handling becomes a strength your committee respects.

Why the reason for missingness decides your method

Before you fix anything, you have to name the pattern. Data can be missing completely at random, where the gaps are unrelated to anything in the study; missing at random, where missingness depends on other variables you did measure; or missing not at random, where it depends on the missing value itself, such as high earners declining to report income. The pattern matters because a method that is safe under one assumption is biased under another. Stating which pattern you believe applies, and why, is the part your assessors look for first.

Your collected sample is what you use to make claims about the wider population, so gaps in that sample threaten every inference unless they are handled with care.

The diagram above is a reminder of what is at stake. Your sample is the bridge to the population, and missing values weaken that bridge by shrinking the sample and potentially skewing who remains in it. That is the same logic behind descriptive and inferential statistics: if the sample is no longer representative, the inference built on it cannot be either.

The methods, from simplest to most defensible

The crudest option is listwise deletion, dropping any case with a missing value; it is acceptable only when data are missing completely at random and losses are small, because otherwise it discards statistical power and can introduce bias. Mean substitution, replacing gaps with the variable average, is widely discouraged because it understates variance and distorts relationships. The stronger families are multiple imputation, which generates several plausible complete datasets and pools the results to reflect the uncertainty of the gaps, and maximum likelihood estimation, which uses all available data directly. These two are the methods most likely to satisfy a methodologist.

Listwise vs pairwise deletion in practice

Deletion comes in two flavours that behave very differently, and confusing them is a common slip. Listwise deletion, also called complete-case analysis, drops a case entirely if it is missing any value used in the analysis, so every test runs on the same reduced sample. Pairwise deletion keeps a case for the specific calculations its available values allow, which means a correlation matrix can end up with each cell computed from a different subset of people and a different sample size. Pairwise preserves more data, but the shifting denominators can produce a matrix that is internally inconsistent and, in some cases, mathematically awkward for procedures that need a single coherent sample.

As a rule, neither deletion method is safe unless the data are missing completely at random and the losses are modest. When more than a small fraction is missing, both throw away statistical power your study cannot spare and can bias estimates. That is usually the point at which an imputation approach becomes the better choice for a defensible thesis.

Multiple imputation for missing data, step by step

Multiple imputation is the method most committees now expect, and it works in three clear stages. First, the imputation stage creates several complete copies of the dataset, filling each gap with a plausible value drawn from a model of the observed data, so the copies differ in a way that reflects genuine uncertainty. Second, the analysis stage runs your intended test, such as a regression model and its assumptions, separately on every copy. Third, the pooling stage combines the estimates using Rubin's rules, widening the standard errors to acknowledge that the missing values were never truly known.

Five to twenty imputed datasets are typical, with more needed when a larger share is missing. Include in your imputation model every variable that will appear in the analysis, plus auxiliary variables that predict missingness, since omitting them weakens the missing at random assumption the method relies on. Reporting the number of imputations and the variables used signals that you applied the technique deliberately rather than as a black box.

Diagnosing and documenting the gaps

Practical handling starts in your software. Tabulate how much is missing per variable and per case, and look for whether the gaps cluster, which you would catch while cleaning data in SPSS and again when reading the SPSS output. Watch the interaction with your distributional checks too: imputing values changes the shape of a variable, so re-run any test for normality after imputation rather than before. Keep a clear record of how many values were missing, the pattern you assumed, the method you chose, and the justification, because that audit trail is what makes the analysis reproducible.

Reporting your approach with confidence

In the write-up, state the extent of missingness, the assumed mechanism, the method you applied, and any sensitivity check you ran to confirm the conclusions do not hinge on it. A short, honest paragraph that says you used multiple imputation under a missing-at-random assumption and compared it against a complete-case analysis reads far better than silence. Missing data is not a flaw to hide; treated transparently, it is evidence that you understood your dataset well enough to defend it.