Lesson 6: Modeling nonlinearity

Last updated in April 2021

Goals for today

Learn how to model model nonlinearity using using:
- categorization
- Simple polynomials
- Linear and natural splines

Nonlinear relationships

Nonlinear relationships are common in practice

Age vs voter turnout in parlaiment election 2017, ESS

Nonlinear relationships

Modeling the relationship between voter turnout and age as linear is not sufficient

mod1 = lm(vote ~ agea, data = vote)

Nonlinear models

How can we change our model to capture the nonlinear relationship?
Three popular options:
- Categorizations
- Simple polynomials
- Splines

Categorization

Most basic way of dealing with nonlinearity
There are many different ways a numerical variable can be cut into categories:
- based on quantiles
- to produce equaly wide intervals
- to produce prespecified number of categories
- based on theory

Categorization - categories of equal range

Categorization - same number of observations

Categorization - groups with specified width

Categorization - based on theory

R Intermezzo!

Categorization - pros and cons

The main advantage of categorizations is that the output is easily interpretable
However, there are many technical drawbacks (Harrell, 2001):
- Estimated values will have reduced precision, and associated tests will have reduced power
- Categorization assumes that the relationship between the predictor and the response is flat within intervals
- Categorization assumes that there is a discontinuity in response as interval boundaries are crossed.
- Cutpoints are arbitrary and manipulatable; cutpoints can be found that can result in both positive and negative associations

Categorization - often shown like this…

Categorization - … but actually this

Simple polynomials

Polynomials

Some of the problems of categorization can be solved by using simple polynomials (e.g. quadratic effect)

\[ vote = \beta_0 + \beta_1*age + \beta_2*age^2 \]

We are essentialy trading the assumption that the relationship is linear for the assumption is in the form of a polynomial (e.g. parabola)

Polynomials

Polynomials - raw and orthogonal types

There are two types of polynomials: raw and orthogonal ones
- Raw polynomial is simply a variable taken to the power of k (e.g. age²)
- Orthogonal polynomials are computed so that the polynomials are uncorrelated with the lower forms (e.g. age will be uncorrelated with age²)

The advantage of raw polynomials is that regression coefficients have the classic interpretation
The advantage of orthogonal polynomials is that it’s easier to compute how much variance each of the forms predicts

R Intermezzo!

Polynomials - Pros and cons

Simple polynomials alleviates some of the problems of categorization (arbitrary cutpoints, assumptions of flat intervals)
However, two problems:
- Polynomials can only capture polynomial relationships
- Polynomials are extremely unstable at the ends

Polynomials - Pros and cons

Splines

Also known as piecewise regression
Basic idea is simple - Instead of trying to fit a single line/curve through the entire data, cut it into smaller bins

Splines

The values dividing individual bins are called knots
General form of the model from previous slide:

\[ vote = \beta_0 + \beta_1*age_{<25} + \beta_2*age_{25-50} + \beta_3*age_{50-75} + \beta_3*age_{>75} \]

We will learn about two types of splines: linear splines and natural splines (although many other types exists)

Linear splines

Linear splines divide the data into bins and then fit a (continuous) line through each of them

Linear splines as a more flexible categorization

Linear splines

We can think of linear splines as a more flexible version of categorization
- Still suffers with the problems of arbitrary cutpoins, same as categorization
- BUT we no longer have the unrealistic assumptions that all observations inside a bin have the same value of the dependent variable

Still, there is the problem of the the arbitrary knots positions and the sudden change in slope
Can we do better?

Natural splines

also known as restricted cubic splines
Instead of fitting lines, we fit polynomial (cubic) terms for the inner bins and lines for the outer ones

Natural splines

Natural splines solve both the problems of categorization/linear splines (sudden changes in slope, arbitrary cutpoints) and simple polynomials (only capturing polynomial relationships, unstable at the ends)
Natural splines are one of the most flexible ways for modeling nonlinearity in the context of linear models

Natural splines as a more flexible polynomials

R Intermezzo!

Splines - Choosing cutpoints

How to choose the number and position of cutpoints?
Linear splines
- Very sensitive to cutpoins position (same as categorization) -> best choose based on theory
Natural splines
- As long as the cutpoints are evenly spaced, their position matters less, what’s important is their number

Splines - choosing cutpoins

Typical position, based on Harrell (2001, p. 27):

knots			Quantiles
3		0.1	0.5	0.9
4		0.05	0.35	0.65	0.95
5		0.05	0.275	0.5	0.725	0.95
6	0.05	0.23	0.41	0.59	0.77	0.95
7	0.025	0.1833	0.33417	0.5	0.6583	0.8167	0.975

Splines - choosing cutpoins

But, as long as evenly spaced and symmetrical, the position of knots doesn’t matter for natural splines
It is unlikely you will need more than 3-4 knots for most data (including the outer ones)
Every knot is an additional parameter in the model -> you can use adjusted R² to test how many knots you need

Splines - Pros and cons

Linear splines:
- Essentialy a better version of categorization, but still suffers from the arbitrary cutpoint/knots positions and sudden changes in slope
- The advantage over natural splines is that they can still be interpreted from coefficients table
Natural splines
- More flexible than the other methods, robust to exact positions of the knots
- Can only be interpreted through marginal effects plots

Nonlinearity - what to use?

Use categorization when presenting your analysis to lay audience.
Use linear splines if you present to professionals, but still want interpretable regression coefficients.
Use natural/restricted cubic splines if you are analyzing potentially complex relationships and are comfortable with using marginal effect plots.
Avoid simple polynomials altogether.

Nonlinearity - what to use?

References

Harrell, F. (2001). Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer-Verlag. https://doi.org/10.1007/978-1-4757-3462-1