Last updated in February 2022

Goals for today

  • Regression as a tool: Basic concepts and building blocks for regression analysis
  • Centering variables for better interpretation
  • Dummy variables for inclusion of categorical predictors
  • Building and interpreting simple linear regression in R (yay, we are gonna work in R!)

Linear regression: basic building blocks

The line

The simple (i.e., bivariate) regression line has the following formula:

\[ Y = \alpha + \beta*X \]

  • where \(\alpha\) is the intercept, and \(\beta\) is the slope. (What does it mean?)

The line and the points

Error vs. Residual

  • Error (theoretical) = difference between the real value of Y and the real average value of Y for given X
  • Residual (empirical) = difference between our observed value of Y and its expected value (expected by our model)

Ordinary least square (OLS)

  • Another name for linear regression

  • The model is estimated by minimizing the sum of squares of residuals. (By definition, the sum of residuals is then 0.)

  • \(\small Residual\:sum\:of\:squares=-2^2+1.1^2+\\\small2.8^2+(-4)^2+1.6^2+1.8^2+\\\small(-0.3^2)+(-0.2^2)+\\\small(-0.1^2)+(-0.7^2)=35.3\)

  • \(\small Sum\:of\:residuals=-2+1.1\\\small+2.8-4+1.6+1.8-0.3\\\small-0.2-0.1-0.7=0\)

For details on how OLS is calculated, see Fox (2015, p. 83)

Regression and t-test

The line in the model can be thought of as conditional mean (as in the picture below). Simple regression with one binary predictor is equivalent to t-test.

Centering predictors for better interpretation

  • Centering in OLS - interpretation purposes
  • Centering predictors by subtracting the mean: intercept interpreted as value of Y when the value of X is set to its mean
  • Using conventional centering point (such as subtracting 100 in IQ)

Dummy variables

Categorical variables among predictors are usually treated as dummies:

  • Binary factors: transformed to one dummy variable
  • Multivariate factors: transformed into set of binary dummy variables (n-1)
  • Regression coefficients identify differences in group means compared to one reference group

Sets of binary dummy variables

Example of a piece of data matrix. Each row represents one observation.

di_cat di_cat_Flawed democracy di_cat_Full democracy di_cat_Hybrid regime
Hybrid regime 0 0 1
Full democracy 0 1 0
Full democracy 0 1 0
Flawed democracy 1 0 0
Full democracy 0 1 0
  • Decide for reference category and do not include it in the model
  • The mean for the reference category will be captured by the intercept
  • The coefficients of the dummies = distances of their conditional mean from the reference category

Let’s see this in R

References

Fox, J. (2015). Applied regression analysis and generalized linear models (Third edition). SAGE Publications, Inc.