Goals for today

  • Understand what can regression be used for
  • Learn ho to select variables for explanative models

What is regression?

What is regression?

What is regression?

What is regresion?

What is Regression good for?

What is Regression good for?

If you wish to make an apple pie from scratch, you must first invent the universe.

  • Carl Sagan

Backpedalling Mars?

Mars, what are you doing?

Ptolemaic model

Universe according to Ptolemy

Copernicus

Universe accordin to Copernicus

To Explain or To Predict?

To Explain or To Predict?

Predictive models

  • Predicting the future (or the past!)
  • Interpretation less important.
  • Goal: estimate (unseen) observations as best as possible.

Explanative (causal) models

  • Explaining workings of the universe.
  • Predictive power less important.
  • Goal: Estimate model parameters as best as possible.

To Explain or To Predict?

  • Other differences include: choosing variable, evaluating model fit, choosing sample sizes and more…

  • For more details, see: Shmueli, G. (2010). To Explain or To Predict? (SSRN Scholarly Paper ID 1351252). Social Science Research Network. https://doi.org/10.2139/ssrn.1351252

Other uses of regression models

Descriptive Models

  • Basicaly just a math summarization
  • Goal: Summarize structure of the data

Inferential models

  • Sample to population inference
  • Goal: Describe population as best as possible

What can regression be used for?

  • Predictive models: Which people are going to vote?
  • Explanative models: What is the effect of age on voter turnout?
  • Descriptive models: What is the relationship between age and voter turnout?
  • Inferential models: How many people are going to vote?


  • Which model are you aiming for?

  • We are going to be mainly interested in explanative models.

Variable selection

Variable selection

  • A researcher is interested in the relationship between intelligence and work self-discipline among adults, but is short on funding.

  • Their collegue suggests using university students as their sample.


  • Is this a valid design decision?

Variable selection

  • Goal of analysis matters

Predictive models

  • Goal: Estimate unseen observations as best as possible.

  • Training vs testing set, crossvalidation

Explanative models

  • Goal: Estimate model parameters as best as possible.

  • Adjusting for interfering variables, randomization, DAGs

Directed acyclic graphs

  • Some fields can rely on randomization of treatment (e.g. drug testing). Social sciences generally can’t.

  • Strong focus on theory, with help of Directed Acyclic Graphs.

DAG example

DAG example

Knowledge and vaccination rates

Does increasing knowledge about Covid raise the probability a person gets vaccinated?

Knowledge leads to behavior?

Knowledge leads to behavior?

Interfering variables

  • What if we only have cross-sectional data?


  • Should we control for:
    • Socio-economic status?
    • past hospitalization?
    • Percieved threat?

Types of interfering variables

  • 4 types of interfering variables:
    • Confounders (common parent)
    • Colliders (common child)
    • Mediators
    • Moderators

Confounders

  • Assume socio-econ. status raises knowledge about Covid and also raises vaccination probability.
  • What does is it mean for the estimate of knowledge -> vaccination?

Colliders

  • Assume Covid knowledge lowers probability of hospitalization and at the same time, vaccination lowers probability of hospitalization.
  • What does is it mean for the estimate of knowledge -> vaccination?

Colliders - die example

  • Two 6 sided die, fair, independent.
  • Can you tell outcome of the second dice based on outcome of the first one?

Colliders - die example

  • Two 6 sided die, fair, independent.
  • But this time, we know the total.

Colliders - die example

  • Conditioning on / controlling for colliders creates artificial relationships.

Colliders

Colliders

  • We should never condition on colliders.

Mediators

  • How could Covid knowledge influence vaccination probability?
  • Perhaps by raising the perceived threat of health problems?
  • Presence of mediators changes interpretation of regression coefficients!
3

3

Moderators

  • What if the relationships between Covid knowledge and vaccination probability changes based on where people live?
  • Moderators are interactions. Without them, we get an average effect.
3

3

The final DAG

Variable selection

  • A researcher is interested in the relationship between intelligence and work self-discipline among adults, but is short on funding.

  • Their collegue suggests using university students as their sample.


  • Is this a valid design decision?

DAGs in R

DAGs in R

  • While drawing DAGs can be easily done by hand, R can make them pretty
  • It can also tell you what variables to control for, assuming your DAG is correct.


  • We need to packages to work with DAGS in ggplot2 framework:
install.packages("dagitty") # for drawing DAGs in R
install.packages("ggdag") # For making them in ggplot2

Basic DAG in R

dagify(y ~ x + z + q ,
       x ~ z,
       q ~ x,
       w ~ y + x) %>% 
  ggdag() +
  theme_void() # to get rid of the backround

Basic DAG in ggplot2 syntax

  • Same result as previously, more control over the looks
dagify(y ~ x + z + q ,
       x ~ z,
       q ~ x,
       w ~ y + x) %>% 
  ggplot(aes(x = x, xend = xend, y = y, yend = yend)) +
  geom_dag_edges() +
  geom_dag_point() +
  geom_text(aes(label = name), color = "white") +
  theme_void()

Custom labels and position

dagify(y ~ x + z + q ,
       x ~ z,
       q ~ x,
       w ~ y + x,
       labels = c(y = "Vaccination\nProbability", x = "Covid\nKnowledge",
                  z = "Soc-econ. status", q = "Perceived threat", w = "Hospitalization"),
       coords = list(x = c(y = 2, x = 1, z = 1.5, w = 1.5, q = 1.5),
                     y = c(y = 1, x = 1, z = 1.15, w = 0.85, q = 0.94))) %>% 
  ggplot(aes(x = x, xend = xend, y = y, yend = yend)) +
  geom_dag_edges() +
  geom_dag_point() +
  geom_text(aes(label = label), color = "red") +
  theme_void()

Check for confounders

our_dag <- dagify(y ~ x + z + q ,
                  x ~ z,
                  q ~ x,
                  w ~ y + x,
                  exposure = "x",
                  outcome = "y")

ggdag_adjustment_set(our_dag, shadow = T) + theme_void() # red ones are confounders

Check for confounders graphically

our_dag <- dagify(y ~ x + z + q ,
                  x ~ z,
                  q ~ x,
                  w ~ y + x,
                  exposure = "x",
                  outcome = "y")

ggdag_collider(our_dag) + theme_void()