What are the 4 assumptions of regression?

The four core assumptions are linearity, independence of errors, homoscedasticity (constant variance of errors), and normality of the residuals. Together they ensure the coefficients, standard errors, and p-values from the model can be trusted. You check each with a specific diagnostic, such as a residual plot or a Q-Q plot.

What are the 5 assumptions of linear regression?

The five assumptions are the four core ones, linearity, independence of errors, homoscedasticity, and normality of residuals, plus the absence of severe multicollinearity among the predictors. The fifth is checked with the variance inflation factor. Meeting all five lets you interpret the model with confidence.

What are the 4 assumptions of Gauss Markov?

The Gauss-Markov assumptions are a linear model, an error term with a mean of zero, constant error variance (homoscedasticity), and errors that are uncorrelated with one another. When they hold, ordinary least squares gives the best linear unbiased estimates. Notably, Gauss-Markov does not require the errors to be normally distributed.

What are the 5 assumptions of OLS?

For ordinary least squares the usual five are linearity in the parameters, a random or independent sample, no perfect multicollinearity, a zero conditional mean of the errors, and homoscedasticity. Adding normality of the errors lets you justify the p-values and confidence intervals. These conditions underpin valid inference from the model.

Linear regression assumptions, checked

Linear regression assumptions are the conditions your data must meet for the model's coefficients, standard errors, and p-values to be trustworthy. The core set is linearity, independence of errors, homoscedasticity, and normality of residuals, with no severe multicollinearityoften added as a fifth. Check them before you interpret a single coefficient, because a violated assumption can quietly invalidate your whole results chapter.

Why your committee asks you to check assumptions first

A regression will always print numbers, whether or not the assumptions hold, so the output looks convincing even when it is misleading. Your supervisor knows this, which is why a credible results chapter shows the diagnostic checks before the interpretation. Reporting a tidy table of coefficients without evidence that the residuals behave is a common way to lose marks. Treat the checks as part of the analysis, not an optional appendix.

The assumptions, one diagnostic at a time

Linearity means the relationship between each predictor and the outcome is genuinely straight; you check it with a plot of residuals against fitted values, looking for a flat band with no curve. Independence of errors means one observation's error does not predict the next, which matters most for time-ordered or clustered data and is tested with a Durbin-Watson statistic. Homoscedasticity means the spread of residuals stays constant across fitted values; a funnel shape in that same residual plot signals the heteroscedasticity you want to avoid. Normality of residuals means the errors follow a normal distribution, which you judge with a Q-Q plot rather than the raw outcome variable.

A Q-Q plot of the residuals: points hugging the diagonal indicate the normality assumption is met, while systematic departure at the tails flags a problem.

Read the Q-Q plot above as the assumption-check for normality: when the points track the dashed reference line, the residuals are close to normal. Bends away from the line, especially at the tails, point to skew or heavy tails. This is the same logic you use when testing a variable for normality, except here you apply it to the model's residuals, not to the raw data.

Checking linearity and homoscedasticity in one plot

The single most useful diagnostic is the scatter of residuals against fitted values, because it carries the evidence for two assumptions at once. To judge linearity, look at the central tendency of the cloud: a roughly flat, horizontal band says the straight-line relationship holds, while a bowed or U-shaped pattern says a predictor enters the model non-linearly and probably needs a squared term or a transformation. To judge homoscedasticity, look at the vertical spread of the points from left to right. Equal scatter across the range of fitted values is what you want; a fan or funnel that widens as fitted values grow is the classic sign of heteroscedasticity. Many students confuse this spread with simple noise, so it helps to overlay a smoothed reference line. The same eye for pattern serves you when reading a regression table in SPSS, where the plots sit beside the coefficients.

Multicollinearity and the variance inflation factor

Multicollinearity occurs when two or more predictors carry overlapping information, which inflates standard errors and makes individual coefficients unstable and hard to interpret. The standard diagnostic is the variance inflation factor, the amount by which the variance of a coefficient is enlarged because that predictor is correlated with the others. A common rule of thumb treats a value above 5 as a warning and above 10 as a serious problem, though some fields use a stricter cut of 2.5. When you spot inflated values, options include dropping one of a redundant pair, combining them into a single index, or centring an interaction term before it is formed. Inspecting the predictor correlations first, much as you would when reading a correlation matrix from the correlation calculator, often reveals the culprit before you ever run the model.

The fifth assumption and the Gauss-Markov conditions

The fifth item people add is no severe multicollinearity: your predictors should not be so highly correlated that the model cannot separate their effects, which you check with the variance inflation factor. The narrower Gauss-Markov conditions, which guarantee that ordinary least squares gives the best linear unbiased estimates, are a related but distinct list: a linear model, a zero-mean error term, constant error variance, and uncorrelated errors. Gauss-Markov does not require normality at all; normality is what you add to justify the p-values and confidence intervals. If any predictor is itself binary, you may be drifting toward interpreting a logistic regression instead.

What to do when an assumption fails

A failed check is a finding, not a dead end. Curvature in the residuals often responds to a transformation or a polynomial term; heteroscedasticity can be handled with robust standard errors; influential points show up in leverage diagnostics and may need justified removal during reading the SPSS output. Remember too that a strong correlation between predictors is about association, and association is not the same thing as the causal story you tell, a distinction worth keeping straight from correlation versus causation. Document every check, every fix, and the reasoning behind it, and your regression will read as careful rather than convenient.

Normality of residuals: what it does and does not require

A persistent misconception is that your outcome variable must be normally distributed. The assumption is narrower: it is the residuals, the leftover errors after the model is fitted, that should be approximately normal, and only so that the p-values and confidence intervals are valid. You assess it visually with a Q-Q plot or a histogram of the residuals, and formally with a Shapiro-Wilk test, applying the same logic covered in how to test a variable for normality. There is good news for larger studies. Because of the central limit theorem, the sampling distribution of the coefficients approaches normality as the sample grows, so mild departures matter little once you have a few hundred cases. Reserve your concern for small samples or residuals with heavy skew, where a transformation of the outcome usually brings the distribution back into line.