Lesson 7: Modeling heteroscedasticity

Last updated in April 2022

Goals for Today

mod1 = lm(infantMortality ~ tfr, data = un)
plot(mod1, which = 1)

Regression coefficients’ estimates become less efficient.
Standard errors will be biased (usually too narrow confidence intervals, too low p values)

If we assume homoskedasticity, we can estimate variance of residuals for every value of x be computing variance of all residuals at once:

\[ SE_\beta = \frac{\sigma^2 \sum(x_i - \bar{x})^2}{[\sum(x_i - \bar{x})^2]^2} = \frac{\sigma^2}{\sum(x_i - \bar{x})^2} \]

If the assumption of homoscedasticity is violated, we cannot use the total residual variance as an estimate for individual levels.

\[ SE_\beta = \frac{\sigma^2 \sum(x_i - \bar{x})^2}{[\sum(x_i - \bar{x})^2]^2} \]

\[ SE_\beta = \frac{\sum(x_i - \bar{x})^2*\sigma_i^2}{[\sum(x_i - \bar{x})^2]^2} \]

\[ SE_\beta = \frac{\sum(x_i - \bar{x})^2*\sigma_i^2}{[\sum(x_i - \bar{x})^2]^2} \]

The previous formula is the original approach
Works well in big samples, biased in small ones.
Various corrections proposed:
- HC1 (original correction, probably the worst one)
- HC2
- HC3

- in R all implemented in the estimatr package

For moderately large sample sizes (hundreds and more), little difference between HC1-HC3
For very small samples (e.g. 30 observations), HC3 seems best

For small samples, robust SE and tests based on them may be biased (especially, if homoscedasticity holds).
Robust SE are less efficient.
Violations of homoscedasticity may be sign of a missing predictor or misspecification.

In practice, robust SE are very useful (and very underutilized in sociology), but not a panacea.

Fortunately robust standard errors can help here (sometimes).
Clustered Standard error estimates residual variance for each cluster separately.