Simple linear regression

Last updated in February 2022

Goals for today

Regression as a tool: Basic concepts and building blocks for regression analysis
Centering variables for better interpretation
Dummy variables for inclusion of categorical predictors
Building and interpreting simple linear regression in R (yay, we are gonna work in R!)

Linear regression: basic building blocks

The line

The simple (i.e., bivariate) regression line has the following formula:

\[ Y = \alpha + \beta*X \]

where \(\alpha\) is the intercept, and \(\beta\) is the slope. (What does it mean?)

The line and the points

The formula \(Y = \alpha + \beta*X\) represents the line.
But the actual value of Y is usually below or above the line.
The actual value of Y = linear model + random error term
\(y_i = \alpha + \beta*x_i + \epsilon_i\)

Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/

Error vs. Residual

Error (theoretical) = difference between the real value of Y and the real average value of Y for given X
Residual (empirical) = difference between our observed value of Y and its expected value (expected by our model)

Ordinary least square (OLS)

Another name for linear regression
The model is estimated by minimizing the sum of squares of residuals. (By definition, the sum of residuals is then 0.)
\(\small Residual\:sum\:of\:squares=-2^2+1.1^2+\\\small2.8^2+(-4)^2+1.6^2+1.8^2+\\\small(-0.3^2)+(-0.2^2)+\\\small(-0.1^2)+(-0.7^2)=35.3\)
\(\small Sum\:of\:residuals=-2+1.1\\\small+2.8-4+1.6+1.8-0.3\\\small-0.2-0.1-0.7=0\)

For details on how OLS is calculated, see Fox (2015, p. 83)

Regression and t-test

The line in the model can be thought of as conditional mean (as in the picture below). Simple regression with one binary predictor is equivalent to t-test.

Centering predictors for better interpretation

Centering in OLS - interpretation purposes
Centering predictors by subtracting the mean: intercept interpreted as value of Y when the value of X is set to its mean
Using conventional centering point (such as subtracting 100 in IQ)

Dummy variables

Categorical variables among predictors are usually treated as dummies:

Binary factors: transformed to one dummy variable
Multivariate factors: transformed into set of binary dummy variables (n-1)
Regression coefficients identify differences in group means compared to one reference group

Sets of binary dummy variables

Example of a piece of data matrix. Each row represents one observation.

di_cat	di_cat_Flawed democracy	di_cat_Full democracy	di_cat_Hybrid regime
Hybrid regime	0	0	1
Full democracy	0	1	0
Full democracy	0	1	0
Flawed democracy	1	0	0
Full democracy	0	1	0

Decide for reference category and do not include it in the model
The mean for the reference category will be captured by the intercept
The coefficients of the dummies = distances of their conditional mean from the reference category

Let’s see this in R

References

Fox, J. (2015). Applied regression analysis and generalized linear models (Third edition). SAGE Publications, Inc.