library(tidyverse) # for pipes, count function and ggplot
countries = read.csv("data/countries.csv")
  1. Using the countries data, create a regression model predicting the percentage of people at risk of poverty (poverty_risk) from the value of democracy index (dem_index).
countries %>% ggplot(aes(x=dem_index, y=poverty_risk))+geom_point()+geom_smooth(method = "lm") # remember to visualize before modeling 

mod1 = lm(poverty_risk ~ dem_index, data = countries)
  1. Interpret regression coefficients of the model.
summary(mod1)

"Intercept - the expected proportion of people at the risk of poverty in a country with the democracy index of zero. In such country, we would expect to see 63% (0.63) of people at the risk of poverty."

"dem_index - the difference in the expected proportion of people at the risk of povery when the democracy index raises by one. According to our model, countries with 1 additional point on the democracy index have on average 5 percentage points smaller proportion of people at the risk of poverty"
  1. Is the interpretation of the intercept in our model (as stated in previous solution) meaningful?
"Not really, it is only technical. On inspection of the data, you will notice there are no countries in the sample even close to the democracy index of 0. So the interpetation is, in fact, a wild extrapolation to an area for which we have no data."
  1. Center the democracy index variable and fit the model again. How does the interpretation of the coefficients changes?
countries$dem_index_c = countries$dem_index - mean(countries$dem_index, na.rm = TRUE)

mod2 = lm(poverty_risk ~ dem_index_c, data = countries)
summary(mod2)

"The intercept now represents the proportion of people at the risk of poverty in a country with average democracy index (average within our sample, of course). This is much more useful. The interpretation of the slope coefficient does not change."
  1. Use the categorized democracy index instead, i.e., the variable di_cat as predictor. What do you need to do to include a categorical predictor in linear regression analysis? Interpret your model.
str(countries$di_cat) # check that the variable is not coded as numeric
countries %>% count(di_cat) # out of curiosity, we look at a basic frequency table

mod3 = lm(poverty_risk ~ di_cat, data = countries)
summary(mod3)

"We did not have to do anything special before running the model, just check that that the di_cat variable is not coded as numeric. Since it is not, R automatically transforms it into factor (if it is not a factor yet) and then to dummies. It automatically includes only k-1 of the dummies in the model. We see, looking at the intercepts, that full democracy have 6 percentage points less people living in poverty, on average, compared to Flawed democracies. Hybrid democracies, on the other hand, have 16 percentage points higher povery rate the Flawed democracies, on average. But what is the poverty rate of Flawed democracies to begin with? It is the intercept, i.e. 25.5 %."
  1. Regress per-capita gdp on eu_membership. Interpret the results. Note the terminology. We sometimes say regress Y on X. This means use Y as dependent and X as independent variable. You may need to first create the per capita variable.
countries <-
  countries %>% mutate(percapita_gdp = (gdp*1000000) / population)

mod1  = lm(percapita_gdp ~ eu_member, data = countries)
summary(mod1)

"The average per-capita GDP in countries which are not EU members is 36 432 EUR. The average per-capita income in countries which are EU members is 5956 EUR less, that is 30476. However, the precision of the estimate of the difference is poor. Standard error is huge and p value is very far from the standard significance treshold of 0.05. Hence the result is purely descriptive for the sample. Also remember that regression estimates for one binary predictor are just a form of expressing the group means. Compare with the code below."

countries %>% group_by(eu_member) %>% summarise(mean = mean((gdp*1000000) / population, na.rm=TRUE))
  1. You may be surprised by the result of the previous task. Try to plot the data to get some additional insights.
countries %>% 
  mutate(eu_member = as.numeric(eu_member=="yes")) %>% # need to transform to numeric for geom_smooth to work
  ggplot(aes(x=eu_member, y = (gdp*1000000) / population)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

"There are many ways to plot data and get insights. Our solution shows that the countries which are not in EU are of two distinct kinds. They either have very low per-capita GDP, or very high per-capita GDP. On avarage, they have a little higher per-capita GDP than EU members, but there are actually no 'average' non-EU countries."
  1. Given that Czech Republic is in the EU, by how much is its per-capita GDP bigger or smaller than that predicted by the model?
countries %>% filter(code == "CZ") # Czechia is the 3rd row (variable X is the row number)

mod1$residuals[3] # this is the answer, Czechia's per-capita GDP is almost 11 000 EUR smaller than the mean of EU countries 
mod1$fitted.values[3] # this is the predicted values, i.e. the mean of EU countries
countries %>% group_by(eu_member) %>% summarise(mean(percapita_gdp), na.rm = TRUE) # you can check OLs peridiction in this case is nothing else than a group mean
mod1$model[3,] # this is the actual value of Czechia's per-capita GDP