First, we will be using both the familiar ‘countries’ data set for this set of exercises (task 1), and then a new dataset, United Nations, for the rest. You will need to download this new dataset from the course webpage and make yourself familiar with it.

Tasks

Use the countries dataset. Regress per-capita GDP (must create first) on the percentage of university educated (‘uni_prc’). Then regress per-capita GDP on both percentage of university educated (‘uni_prc’) and democratic index (‘dem_index’) in a single multiple regression model. Interpret the results.

# compute percapita_gdp
countries <-
  countries %>% mutate(percapita_gdp = (gdp*1000000) / population)

mod1 <- lm(percapita_gdp ~ uni_prc, data = countries)
summary(mod1)

"This model is not very nice to intepret at its face value. It says that percapita GDP of a country with no university educated people is -26404 EUR. A non-sensical value which result from the fact that there are no countries without university educated people (yet we extrapolate to such hypothetical countires). It further says that the difference between the hypothetical countries with no uni educated and countries with all people having university degree is 197313 EUR. From this, we can easily calculate that the difference in percapita GDP between two countries which differ by 1 percentage point in their proportion of uni educated people is 1973 EUR (197313 / 100). The difference is statistically significant from 0."


mod2 <- lm(percapita_gdp ~ uni_prc + dem_index, data = countries)
summary(mod2)

"The difference in per-capita GDP between countries which differ by one point in their democratic index (ten point scale) AND have the same proportion of university educated is 16310 EUR on average. This is a huge difference substantially and it is also statistically significant. Notice that the coefficient for 'uni_prc' went down dramatically to only 34543 EUR. In other words, for two countries with the same democratic index and differing by 1 percentage point in their proportion of university educated, the difference in per capita GDP only 345 EUR, a number which is likely some random noise given the broad standard errors."

Use the UN dataset. Regress total fertility rate (tfr) on illiteracy.

un %>% glimpse() # when inspecting the dataset, you will notice there are two variables for illiteracy - one for males and one for females. Let's see what happens when we use the male illiteracy, the female illiteracy and when we use both.

mod3 <- lm(tfr ~ illiteracyMale, data = un)
summary(mod3)

"Male illiteracy is a fairly strong predictor of total fertility rate. For each percetnage point of male illiteracy, total fertility rate goes up by 0.068 children per woman. In other words, if two countries differ by 15 (1/0.068) percentage points in male illiteracy, the one with more illiteracy is expected to have 1 more child born per woman."

ggplot(aes(x=illiteracyMale, y = tfr), data = un) + geom_point() + geom_smooth(method="lm")


mod4 <- lm(tfr ~ illiteracyFemale, data = un)
summary(mod4)

"A similar story with slightly different numbers unfolds for female illitaracy, only the beta is a bit smaller."

ggplot(aes(x=illiteracyFemale, y = tfr), data = un) + geom_point() + geom_smooth(method="lm")

mod5 <- lm(tfr ~ illiteracyMale + illiteracyFemale, data = un)
summary(mod5)

"A peculiar thing happnes when we use both. The male illiteracy stops being statistically significant. We know that regression coefficients in MLR are in part dependent on other predictors in the model. This, however, is an extreme example. When two variables are very strongly correlated, like male and female illiteracy, it becomes difficult for the regression analysis to evaluate their independent (net) effect. This is called multicolinearity and we will address this issue in more detail later in the course. Generally, it is better to just use on of the two variables in the final model as the second one, while being a strong predictor in simple linear model, adds (almost) no new information."


cor(un$illiteracyFemale, un$illiteracyMale, use = "complete") # correlation is extreamly high between the two predictors


# You may have decided to center the variables:
un <-
  un %>% mutate(imc = illiteracyMale - mean(illiteracyMale, na.rm = TRUE),
                ifc = illiteracyFemale - mean(illiteracyFemale, na.rm=TRUE))

mod6 <- lm(tfr ~ imc + ifc, data = un)
summary(mod6)

We have mostly (or even only) used two predictor terms in our examples so far to keep things simple. Try now to regress ‘infantMortality’ on ‘tfr’, ‘economicActivityFemale’, and ‘illiteracyFemale’.

Before we start modeling, plot the dependent variable using geom_density.

ggplot(aes(x=infantMortality), data=un) + geom_density() # it is skewed, but we will proceed with the variable as is, without transformation. However, it is especially important to check assumptuions when this kind of distribution happens for the dependent variable. (We wil cover testing for assumptions later in the course.)

Now, plot the bivariate relationship between each predictor variable and the dependent variable, i.e. three plots in total. Use combination of geom_point and geom_smooth with the defualt method, i.e. loess curve.

ggplot(aes(y=infantMortality, x = tfr), data=un) + geom_point() + geom_smooth()
# no surprise here, higher fertility associated with higher infant mortality (both typical of less developed countries). However, notice the spread seems to get bigger for higher values of total fertility rate. This could indicate issues to be covered later in the course.  

ggplot(aes(y=infantMortality, x = economicActivityFemale), data=un) + geom_point() + geom_smooth()
# female economic activity is not lineary associated with infant mortality, but there might be some U-shape non-linear relationship. Again, modelling non-linearity will be introduced further in the course.

ggplot(aes(y=infantMortality, x = economicActivityFemale), data=un) + geom_point() + geom_smooth(method = "lm")
# see the non-effect if the line to fit is forced to linear

ggplot(aes(y=infantMortality, x = illiteracyFemale), data=un) + geom_point() + geom_smooth(method = "lm")
# this is a fairly good looking linear association

Run the model. Regress ‘infantMortality’ on ‘tfr’, ‘economicActivityFemale’, and ‘illiteracyFemale’. Only use the main effects. Interpret the model.

fit1 <- lm(infantMortality ~ tfr + economicActivityFemale + illiteracyFemale, data = un)
summary(fit1)

# total fertility rate, female economic activity, and illiteracy among females all have their independent linear effect on infantMortality. This may be surprising for the female economic activity, where there was no linear association for the bivariate relationship. But this is not too rare: conditional main effects in MLR may differ a lot from bivariate main effects. 

# Interpreting the coefficients: when two hypothetical countries had the same level of female economic activity and female illiteracy rate, but differed in total fertility rate by one child per woman, the country with higher fertility rate is predicted to have 9.6 higher infant mortality rate (infant deaths per 1000 live births), on average. 

# Next, when 2 countries have the same level of total fertility rate and female illiteracy rate, but differ in female activity rate by one percentage point, the country with higher fertility rate will have 0.395 higher infant mortality rate, on average. 

# Finally, when 2 countries have the same level of total fertility rate and female economic activity, but differ in female illiteracy rate by one percentage point, the country with higher female illiteracy rate will have 0.68 higher infant mortality rate, on average.

# Note, correlation is not causation. Even if controlling for other variables, we cannot say that female economic activity causally increases infant mortality. This model is not based on any theory and there may be many other variables we would have to control for before being able to make even a tentative causal statement. The same is true for the other variables. We have no grounds to interpret this model causally at its face value. 

# In addition, we have not checked the assumption of normality (we will learn how to do it later in the course). So we have no guarantee that the linear fit is the best model specification.

There is of course some uncertainty in our estimates. Find 90%, 95% and 99% confidence intervals of the three betas. Use the function confint for that and use help to learn about the function parameter you need to change to change the confidence interval from the 95% default. Before you do that, say which one set of them will be the narrowest and why.

# 95
confint(fit1)

# 90
confint(fit1, level = 0.9) # the narrowest intervals are these because they provide us with least certainty. The more certain we want to be with our interval, the broader it will be.

# 99
confint(fit1, level = 0.99)

We now want to compare the coefficients to each other to compare the relative size of independent contributions of each of our predictors to the explanatory power of our model. Use standardized betas to do that.

fit2 <- lm(scale(infantMortality) ~ scale(tfr) + scale(economicActivityFemale) + scale(illiteracyFemale), 
           data = un)
summary(fit2)

# we see tfr and illiteracyFemale have about the same explanatory power, economicActivityFemale has less

Multiple linear regression - Excercises

Tasks