Lesson 4: Model fit

Last updated in February 2021

Goals for today

Learn how to evaluate model fit using R² and ANOVA
Learn how to compute them in R
Learn about their limitations

Model fit

(Almost) any model can be fitted to our data, but not all models will fit equally well
Three ways to evaluate model fit:
- Checking assumptions the model makes using diagnostic plots (next lecture)
- Fit indexes that summarize fit into a single number
- Formal test of fit (ANOVA/F test)

Coefficient of determination (R²)

The proportion of variance of the depended variable, which can be predicted using the independent variables
- e.g. if R² = 0.32, we can say our model predicts 32% of variance of the dependent variable (in our data)
- Alternatively, the depend variable shares 32% of its variance with the independent variables
Nothing about R² is causal!

Coefficient of determination (R²)

Formally, R² is defined as:

\[ R^{2} =1 - \frac{Sum \: of \: Squares_{residual} }{Sum \: of \: Squares_{total}} \]

Or perhaps in a more interpretable way:

\[ R^{2} =1 - \frac{Sum \: of \: Squares_{our\:model} }{Sum \: of \: Squares_{intercept\:only\:model}} \]

Coefficient of determination (R²)

Intercept only model is one with zero predictors, predictions are made solely based on the grand mean (mean of the dependent variable)
This is the “worst” possible model

Coefficient of determination

R² tells us how much we reduced the prediction error by adding our predictors
if R² = 0, then our model is as “good” as if we had no predictor at all
if R² = 1, then we predict our data perfectly
There is no universal cut-off for when R² is good or bad
- In laboratory calibrations, R² < 0.99 is considered bad and a sign of an equipment failure
- In day to day stock market, R² > 0.02 is considered good and such models are used for trading.

ANOVA

We can also compare two models formally, using ANOVA/F test
Similar classic to ANOVA
Null hypothesis: All regression coefficients (except for intercept) are 0.

ANOVA

We can compare against the intercept only model, i.e. is our model better than model with no predictors?

mod1 = lm(life_exp ~ hdi, data = countries)
anova(mod1)

## Analysis of Variance Table
## 
## Response: life_exp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hdi        1 184.65 184.654  63.047 2.441e-09 ***
## Residuals 35 102.51   2.929                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA

We can also compare two of our models, i.e. does one predict better than the other?

mod1 = lm(life_exp ~ hdi, data = countries)
mod2 = lm(life_exp ~ hdi + postsoviet, data = countries)
anova(mod1, mod2)

## Analysis of Variance Table
## 
## Model 1: life_exp ~ hdi
## Model 2: life_exp ~ hdi + postsoviet
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     35 102.509                                  
## 2     34  59.186  1    43.322 24.887 1.777e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R Intermezzo!

Limitations of R² and ANOVA

Limitations of ANOVA for model comparison

Nested models only
All the classic limitations of null hypothesis testing
- It is extremely unlikely for two models to “explain” the exact same amount of variance -> null hypothesis is almost always false by definition
- Differences do not matter, if they are practically unimportant -> rejecting null hypothesis is by itself not particularly interesting
- Power matters, just like with any other test -> not rejecting null hypothesis does not necessarily mean the two models predict the same amount of variance. We may just not have enough precision to identify the difference

Limitations of R²

R² is fundamentally a measure of predictive strength
It may behave not intuitively when used for other than predictive modeling (and may mislead even for predictive modeling)
There are 4 “gotchas” you need to be careful about

Gotcha 1 - Model with higher R² does not neccesarily provide a better estimate of regression coefficients

R² and coefficient bias

We want to analyze the relationship between intelligence (IQ) and work diligence (diligence). We also know if our respondents have a university degree (degree).
degree is related to both IQ and diligence - only those who are top 20% most intelligent or the top 20% most diligent people will obtain a university degree
Should we control for degree or not? For prediction? For explanation?
(Truth : intelligence = 0.1*diligence, but let’s pretend we don’t know)

R² and coefficient bias

Controlling for degree leads to higher R², but incorrect coefficient estimate!
Not controlling for degree actually provides better estimate of the relationship (remember, true value = 0.1)

R² and coefficient bias

Conclusion: If the goal of an analysis is the interpretation of regression coefficients, do not select variables based on R².

Gotcha 2 - R² depends on the variance of the dependent variable

R² and the variance of dependent variable

Consider variable \(x\) and 3 variables \(y_1\), \(y_2\), \(y_3\)
The relationship between \(x\) and all \(y_i\) is the same:
- \(y_i = 0 + 10*x\)
Each of \(y_i\) has a different standard deviation:
- \(sd_{y_1} = 50\), \(sd_{y_2} = 100\) and \(sd_{y_3} = 150\)

R² and the variance of dependent variable

Notice that R² varies widely, despite all models being perfectly specified and describing the relationship correctly.

R² and the variance of dependent variable

Even perfectly specified model (i.e. all relevant variables present, relationships set up correctly) can have low R² due to random error
Low R² does not necessarily mean the estimates are incorrect (biased)
Low R² can simply mean that we cannot explain a social phenomenon in its entirety, but that is almost never our goal.
Conclusion: If our goal is substantive interpretation of coefficients, R² is not a good measure of model’s quality

Gotcha 3 - R² depends on the number of predictors

R² depends on the number of predictors

Consider variable \(y\) and 15 variables \(x_i\) (\(x_1, \: x_2, \:... \: x_{15}\))
All of these variables are independent of each other
What happens to R², if we start adding \(x_i\) variables as predictors?

R² depends on the number of predictors

Notice that the more predictors in the model, the higher the R², even if the dependent variable \(y\) is not related to any of the independent variables \(x_i\)

R² depends on the number of predictors

R² will increase (almost) every time we add a new predictor, because even if there is no correlation between two variables in the population, sample correlation will rarely be exactly 0 (due to sampling error)
Consequently, to some extent we are predicting random noise (this is the problem of overfitting in predictive models)

R² depends on the number of predictors

Adjusted R² (Henri, 1961) controls for the number of predictors in the model (represented by the degrees of freedom)

\[ R^{2}_{adj} = 1 - (1 - R^{2}) * \frac{no. \: of \: observations - 1}{no. \: of \: observations - no. \: of \: parameters - 1} \]

Adjusted R² only increases when the contribution of a new predictor is bigger than what we would expect by chance
Conclusion: Use Adjusted R² when you are comparing models with different number of predictors

Gotcha 4 - R² depends on the range of independent variables

R² depends on the range of independent variables

Consider variables \(y\) and \(x\)
\(x\) is normally distributed with mean of 50 and standard deviation of 15
The relationship between them is \(y = 3*x\) with residual standard deviation of 50
What would happen if we limited the range of \(x\) to <35;65> ?

R² and the range of independent variable

Notice that despite the coefficients being (virtually) the same, R² gets lower as the range of data gets narrower

R² and the range of independent variable

R² will naturally get lower as we restrict the range of independent variables
This does not mean the model is any less valid, just that predictive power is lower
- e.g. model predicting attitudes based on income will have lower R² in populations with smaller income differences (all else held constant).
Conclusion - Trimming data, either by filtering out subpopulations or removing outliers, will lower R²

Limitations of R²

To summarize:
If the goal is hypothesis testing or causal inference, R² cannot be used to select which variables to include into the model. Doing so can be actively harmful
If the goal is to test a hypothesis or describe a relationship, R² doesn’t indicate quality of the model
R² has to be adjusted when comparing predictive power of models with different number of predictors
The value of R² depends on the range of the independent variables

References

Henri, etc T. (1961). Economic forecasts and policy (2nd edition). North-Holland Pub. Co.

Goals for today

Model fit

Coefficient of determination (R2)

Coefficient of determination (R2)

Coefficient of determination (R2)

Coefficient of determination

ANOVA

ANOVA

ANOVA

ANOVA

R Intermezzo!

Limitations of R2 and ANOVA

Limitations of ANOVA for model comparison

Limitations of R2

Gotcha 1 - Model with higher R2 does not neccesarily provide a better estimate of regression coefficients

R2 and coefficient bias

R2 and coefficient bias

R2 and coefficient bias

Gotcha 2 - R2 depends on the variance of the dependent variable

R2 and the variance of dependent variable

R2 and the variance of dependent variable

R2 and the variance of dependent variable

Gotcha 3 - R2 depends on the number of predictors

R2 depends on the number of predictors

R2 depends on the number of predictors

R2 depends on the number of predictors

R2 depends on the number of predictors

Gotcha 4 - R2 depends on the range of independent variables

R2 depends on the range of independent variables

R2 and the range of independent variable

R2 and the range of independent variable

Limitations of R2

References

Coefficient of determination (R²)

Coefficient of determination (R²)

Coefficient of determination (R²)

Limitations of R² and ANOVA

Limitations of R²

Gotcha 1 - Model with higher R² does not neccesarily provide a better estimate of regression coefficients

R² and coefficient bias

R² and coefficient bias

R² and coefficient bias

Gotcha 2 - R² depends on the variance of the dependent variable

R² and the variance of dependent variable

R² and the variance of dependent variable

R² and the variance of dependent variable

Gotcha 3 - R² depends on the number of predictors

R² depends on the number of predictors

R² depends on the number of predictors

R² depends on the number of predictors

R² depends on the number of predictors

Gotcha 4 - R² depends on the range of independent variables

R² depends on the range of independent variables

R² and the range of independent variable

R² and the range of independent variable

Limitations of R²