- Learn about the assumptions of linear regression
- Learn how to use diagnostic plots
- Learn about the limitations of statistical tests for assumption checking
Last updated in March 2021
Precision
Efficiency
All models are wrong, but some are useful.
- George Box
Helpfully sorted by Gelman et al. (2020) in order of (general) importance:
Validity
Representativeness
Linearity and additivity
Independence of errors
Homoscedasticity of errors
Normality of errors
If our goal is either prediction or inference, our data needs to be a representative sample of the population of interest
More specifically, we assume that the distribution of the dependent variable is representative of the population, given the the predictors
Consider model predicting income
based on age
. What happens if our data are not representative?
\[ y = \beta_0 + \beta_1*x_1 + \beta_2*x_2\:+\:...\:\beta_p*x_p \]
\[ y = \beta_0 + (\beta_1*x_1) * (\beta_2*x_2) \]
\[
\beta_0 + log[(\beta_1*x_1) * (\beta_2*x_2)] = \beta_0 + log(\beta_1*x_1) + log(\beta_2*x_2)
\]
Linear regression assumes that the errors are normally distributed.
Especially important for the prediction of individual observations.
Also important for inference on small samples.
\[ leverage_i = \frac{\partial \hat{y_i}}{\partial y_i} = \frac{partial\:change\:in\:expected\:y_i}{partial\:change\:in\:observed\:y_i} \]
\[ Cook's\:distance_i=\frac{residual_i^2}{number\:of\:parameters*MSE}*\left[\frac{leverage_i}{(1-leverage_i)^2}\right] \] - An observation will generally have high Cook’s distance if both their residual and their leverage is very high
infantMortality
) by total fertility rate (tfr
) and GDP per capita (GDPperCapita
):mod_res = lm(infantMortality ~ tfr + GDPperCapita, data = un)
tfr
? GDPperCapita
? Both?crPlots
from the car
package)Some researchers prefere testing assumptions formally, using statistical tests
This is universally a bad idea and should be avoided
Two reasons:
“We should use normality tests for small samples, in large samples they are too sensitive!”
Assumption | When violated | How to check |
---|---|---|
Validity | biased estimates | good study design |
Representativeness | biased estimates | good study design |
Linearity and additivity | biased estimates | residual plot |
Independence of errors | biased inference | good study design |
Homoscedasticity of errors | biased inference, inefficient estimates | (scaled) residual plot |
Normality of errors | biased ind. prediction, biased inference (small samples) | Q-Q plot, histogram |
(no influential obs.) | (biased estimates) | Cook’s distance, leverage |
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879