Lesson 5: Model assumptions

Last updated in March 2021

Goals for today

Learn about the assumptions of linear regression
Learn how to use diagnostic plots
Learn about the limitations of statistical tests for assumption checking

Bias and random error (rehearsal)

Total error composed of two parts: bias and random error

Bias - systemic error, i.e. systemic under- or overestimation of the expected values
Random error - random deviations from the true value (due to the sampling and measurement error)

Violations of assumptions can increase either or both

Precision and efficiency

Precision
- reciprocal to the random error (high precision = low random error)

Efficiency
- how many observations we need to achieve a specific level of performance
- e.g. model based on data that meet its assumptions require less observations to achieve 80% statistical power, i.e. is more efficient

How to think about assumptions

All models are wrong, but some are useful.
- George Box

No model is perfect picture of reality and that’s ok.
How good the picture is?

How to think about assumptions

Small violations lead to only small error, big ones…

Don’t think about assumptions in binary term (“ok” vs “bad”)
Instead, think how much we deviate from them

Not all assumption are also important for all tasks

What are the assumptions?

Assumptions of linear regression

Helpfully sorted by Gelman et al. (2020) in order of (general) importance:
1. Validity
2. Representativeness
3. Linearity and additivity
4. Independence of errors
5. Homoscedasticity of errors
6. Normality of errors

Validity

Are the data we use are sufficient to answer the research question at hand?
Are all traits measured with sufficient quality?
Are data gathered to so that inference is possible?
Is the model correctly specified?

Violation of this assumption will lead to bias

Representativeness

If our goal is either prediction or inference, our data needs to be a representative sample of the population of interest
More specifically, we assume that the distribution of the dependent variable is representative of the population, given the the predictors

Representativeness

Do the data need to be representative in all aspects?

Consider model predicting income based on age. What happens if our data are not representative?
- More specifically, what if only people with above average age/income participate in our survey?

Representativeness

Estimates even when sample is not representative in age (independent variable).

Linearity and additivity

Two ways people talk about linearity:
- Linearity of relationship between variables
- Linearity of regression terms

What is the difference?

Linearity of relationships

Simple, the relationship between variables is a straight line.

Linearity of forms

Linear models are linear, because they assume linear form:

\[ y = \beta_0 + \beta_1*x_1 + \beta_2*x_2\:+\:...\:\beta_p*x_p \]

No all relationship are neccesary linear though:

\[ y = \beta_0 + (\beta_1*x_1) * (\beta_2*x_2) \]

Model is linear, if the regression coefficients are only either add or subtracted from each other

Linearity and additivity

Some nonlinear models can be translated into linear form using an appropriate transformation:

\[ \beta_0 + log[(\beta_1*x_1) * (\beta_2*x_2)] = \beta_0 + log(\beta_1*x_1) + log(\beta_2*x_2) \]

Linear models can only capture relationships that fulfill this assumption
Violation of this assumption leads to bias

Linearity and additivity

We assume that the effects of independent variables can be simply added together.
Sometimes, this may not be true, e.g. a interaction is needed.

Violation of this assumption leads to bias.

Independence of errors

Linear regression assumes that errors are independent of each other.

Often violated due to sampling design, time series and spatial analysis.

Violation of this assumption leads to incorrect estimation of standard error, and therefore presents a risk to inference

Homoscedasticity (constant variance of errors)

Homoscedasticity = variance of residuals is the same for all levels of \(Y\)
If the variance of the errors is not constant, the the error are heteroscedastic.

Homoscedasticity (constant variance of errors)

Example of A) homoscedastic data B) heteroscedastic data

Homoscedasticity (constant variance of errors)

The violation of this assumption has two consequences.

Firstly, the estimation of the regression coefficients become inefficient.
Secondly, standard errors of the coefficients would be biased

Normality of errors

Linear regression assumes that the errors are normally distributed.
Especially important for the prediction of individual observations.
Also important for inference on small samples.

Violation of this assumptions lead to incorrect prediction for individual observations and for incorrect p values and confidence intervals in small samples.

Diagnostic plots

The main, and arguably best, tool for checking assumptions are diagnostic plots.

Diagnostic plots - Linearity

The assumption of linearity can be checked by plotting predicted values against residuals (usually both are Z transformed)

Diagnostic plots - Linearity

Ideally, there should be no pattern in the data, nonlinear patterns suggests nonlinear relationships that haven’t been accounted for

Diagnostics plots - Homoscedasticity

Can be checked using the same plots and linearity, or the scale location plot.

Diagnostics plots - Normality of residuals

Can be checked either by histogram or Quantile-Quantile plot

R practice!

Diagnostic plots - influentials observations

Not strictly an assumption, but still important.
Influential observations can distort the relationship between variables

Diagnostic plots - influentials observations

Basic measure of influence is leverage
leverage = how big a role an observation plays when fitting a regression line
More precisely, how far the observation lies from the centre of the data

Diagnostic plots - influentials observations

The further the observation from the centre, the higher its leverage

Diagnostic plots - influentials observations

Leverage formally:

\[ leverage_i = \frac{\partial \hat{y_i}}{\partial y_i} = \frac{partial\:change\:in\:expected\:y_i}{partial\:change\:in\:observed\:y_i} \]

How much the bigger effect on the expected value would changing the observed value have, the bigger that observation’s leverage

However, just because an observation has high leverage does not necessarily mean it will distort our model
Only observations with high leverage and high residuals distorts the model

Diagnostic plots - influentials observations

Plots showing the effect of A) high leverage, low residual B) low leverage, high residual c) high leverage, high residual

Diagnostic plots - influentials observations

The information about leverage and residuals can be summarized using Cook’s distance:

\[ Cook's\:distance_i=\frac{residual_i^2}{number\:of\:parameters*MSE}*\left[\frac{leverage_i}{(1-leverage_i)^2}\right] \] - An observation will generally have high Cook’s distance if both their residual and their leverage is very high

Partial residual plots

Classic diagnostic plots allow us to check the model as a whole
But if something looks wrong, how can we tell which variable is the source of the problem?
Consider model predicting infant mortality rate (infantMortality) by total fertility rate (tfr) and GDP per capita (GDPperCapita):

mod_res = lm(infantMortality ~ tfr + GDPperCapita, data = un)

Partial residual plots

From the residual plot, there seems to be a problem with heteroskedasticity and linearity, but which of the predictors is problematic? tfr? GDPperCapita? Both?

Partial residual plots

We can produce partial residual plot (also known as component residual plot) for each predictor (e.g. function crPlots from the car package)

But what about tests?

Diagnostic tests and why not to use them

Some researchers prefere testing assumptions formally, using statistical tests
- Shapiro-Wilk test for normality, Levene’s test for homoscedasticity, etc.

This is universally a bad idea and should be avoided
Two reasons:
- In real data, most (all?) assumptions literaly cannot be absolutely true
- It is not neccesary to fulfill the assumptions to the letter

Diagnostic tests and why not to use them

Consider normal distribution
In real life, normal distribution doesn’t exist -> we know that our residuals don’t follow normal distributio before we even run a test
Consequently, all observed nonsignificant results are false negatives

Diagnostic tests and why not to use them

Some say:

“We should use normality tests for small samples, in large samples they are too sensitive!”

This misunderstands the problem
The problem is the hypothesis, not the test

Tests of equal variance are somewhat more plausible, but still highly unrealistic

Diagnostic tests and why not to use them

Luckily for us, we don’t need all the assumptions to be fullfiled perfectly
Small deviations from the assumptions only lead to a small errors -> for practical purposes, our model will still work fine

Summary

Assumption	When violated	How to check
Validity	biased estimates	good study design
Representativeness	biased estimates	good study design
Linearity and additivity	biased estimates	residual plot
Independence of errors	biased inference	good study design
Homoscedasticity of errors	biased inference, inefficient estimates	(scaled) residual plot
Normality of errors	biased ind. prediction, biased inference (small samples)	Q-Q plot, histogram
(no influential obs.)	(biased estimates)	Cook’s distance, leverage

References

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879