Last updated in April 2021

Goals for today

  • Learn how to model model nonlinearity using using:

    • categorization
    • Simple polynomials
    • Linear and natural splines

Nonlinear relationships

  • Nonlinear relationships are common in practice
Age vs voter turnout in parlaiment election 2017, ESS

Age vs voter turnout in parlaiment election 2017, ESS

Nonlinear relationships

  • Modeling the relationship between voter turnout and age as linear is not sufficient
mod1 = lm(vote ~ agea, data = vote)

Nonlinear models

  • How can we change our model to capture the nonlinear relationship?

  • Three popular options:

    • Categorizations
    • Simple polynomials
    • Splines

Categorization

Categorization

  • Most basic way of dealing with nonlinearity

  • There are many different ways a numerical variable can be cut into categories:

    • based on quantiles
    • to produce equaly wide intervals
    • to produce prespecified number of categories
    • based on theory

Categorization - categories of equal range

Categorization - same number of observations

Categorization - groups with specified width

Categorization - based on theory

R Intermezzo!

Categorization - pros and cons

  • The main advantage of categorizations is that the output is easily interpretable

  • However, there are many technical drawbacks (Harrell, 2001):

    • Estimated values will have reduced precision, and associated tests will have reduced power

    • Categorization assumes that the relationship between the predictor and the response is flat within intervals

    • Categorization assumes that there is a discontinuity in response as interval boundaries are crossed.

    • Cutpoints are arbitrary and manipulatable; cutpoints can be found that can result in both positive and negative associations

Categorization - often shown like this…

Categorization - … but actually this

Simple polynomials

Polynomials

  • Some of the problems of categorization can be solved by using simple polynomials (e.g. quadratic effect)

\[ vote = \beta_0 + \beta_1*age + \beta_2*age^2 \]

  • We are essentialy trading the assumption that the relationship is linear for the assumption is in the form of a polynomial (e.g. parabola)

Polynomials

Polynomials - raw and orthogonal types

  • There are two types of polynomials: raw and orthogonal ones

    • Raw polynomial is simply a variable taken to the power of k (e.g. age2)
    • Orthogonal polynomials are computed so that the polynomials are uncorrelated with the lower forms (e.g. age will be uncorrelated with age2)


  • The advantage of raw polynomials is that regression coefficients have the classic interpretation
  • The advantage of orthogonal polynomials is that it’s easier to compute how much variance each of the forms predicts

R Intermezzo!

Polynomials - Pros and cons

  • Simple polynomials alleviates some of the problems of categorization (arbitrary cutpoints, assumptions of flat intervals)

  • However, two problems:

    • Polynomials can only capture polynomial relationships
    • Polynomials are extremely unstable at the ends

Polynomials - Pros and cons

Polynomials - Pros and cons

Splines

Splines

  • Also known as piecewise regression
  • Basic idea is simple - Instead of trying to fit a single line/curve through the entire data, cut it into smaller bins

Splines

  • The values dividing individual bins are called knots
  • General form of the model from previous slide:

\[ vote = \beta_0 + \beta_1*age_{<25} + \beta_2*age_{25-50} + \beta_3*age_{50-75} + \beta_3*age_{>75} \]

  • We will learn about two types of splines: linear splines and natural splines (although many other types exists)

Linear splines

  • Linear splines divide the data into bins and then fit a (continuous) line through each of them

Linear splines as a more flexible categorization

Linear splines

  • We can think of linear splines as a more flexible version of categorization

    • Still suffers with the problems of arbitrary cutpoins, same as categorization
    • BUT we no longer have the unrealistic assumptions that all observations inside a bin have the same value of the dependent variable


  • Still, there is the problem of the the arbitrary knots positions and the sudden change in slope
  • Can we do better?

Natural splines

  • also known as restricted cubic splines
  • Instead of fitting lines, we fit polynomial (cubic) terms for the inner bins and lines for the outer ones

Natural splines

  • Natural splines solve both the problems of categorization/linear splines (sudden changes in slope, arbitrary cutpoints) and simple polynomials (only capturing polynomial relationships, unstable at the ends)
  • Natural splines are one of the most flexible ways for modeling nonlinearity in the context of linear models

Natural splines as a more flexible polynomials

R Intermezzo!

Splines - Choosing cutpoints

  • How to choose the number and position of cutpoints?

  • Linear splines

    • Very sensitive to cutpoins position (same as categorization) -> best choose based on theory
  • Natural splines

    • As long as the cutpoints are evenly spaced, their position matters less, what’s important is their number

Splines - choosing cutpoins

  • Typical position, based on Harrell (2001, p. 27):
knots Quantiles
3 0.1 0.5 0.9
4 0.05 0.35 0.65 0.95
5 0.05 0.275 0.5 0.725 0.95
6 0.05 0.23 0.41 0.59 0.77 0.95
7 0.025 0.1833 0.33417 0.5 0.6583 0.8167 0.975

Splines - choosing cutpoins

  • But, as long as evenly spaced and symmetrical, the position of knots doesn’t matter for natural splines
  • It is unlikely you will need more than 3-4 knots for most data (including the outer ones)
  • Every knot is an additional parameter in the model -> you can use adjusted R2 to test how many knots you need

Splines - Pros and cons

  • Linear splines:

    • Essentialy a better version of categorization, but still suffers from the arbitrary cutpoint/knots positions and sudden changes in slope
    • The advantage over natural splines is that they can still be interpreted from coefficients table
  • Natural splines

    • More flexible than the other methods, robust to exact positions of the knots
    • Can only be interpreted through marginal effects plots

Nonlinearity - what to use?

Nonlinearity - what to use?

  • Use categorization when presenting your analysis to lay audience.
  • Use linear splines if you present to professionals, but still want interpretable regression coefficients.
  • Use natural/restricted cubic splines if you are analyzing potentially complex relationships and are comfortable with using marginal effect plots.
  • Avoid simple polynomials altogether.

Nonlinearity - what to use?

References