library(tidyverse)
un <- read.table("data/UnitedNations.txt")
We will be using the ‘un’ data set, i.e., the data from United Nations organization.
model1 <- lm(infantMortality ~ lifeFemale + illiteracyFemale, data = un)
summary(model1)
# The answer is 89.84 %. That is a lot. Such a high value of R-squared very rarely occurs in sociological data on individual level, but it can happen sometimes on aggregate level (such as here - data on country level).
model2 <- lm(infantMortality ~ lifeFemale*illiteracyFemale, data = un)
model3 <- lm(infantMortality ~ lifeFemale*illiteracyFemale + region, data = un)
summary(model1)
summary(model2)
summary(model3)
# model 1: 0.8971
# model 2: 0.9018
# model 3: 0.907
# Note, since the comparison is of models with dufferent number of predictors, the formally correct approach is comparing adjusted R-squared. This changes nothing about the fact the with each adding of complexity,the R-squared increases a little. So it looks like the most complex model has some value added over the less complex models beyond just random noise. However, the increase in predictive power is only very small. It would be prefectly justifiable here to to opt for the simplest model with simple and straightforward interpretation. Dependening on the modeling purpose. If you compete to make the best predictions and interprtability is secondary, you will probably wanto to pick the most complex model here.
anova(model1, model2)
anova(model2, model3)
# Quite in concord with the comparison in task 2, comparing models using anova also tells as that moving from model1 to model2 leads to statistically significant improvement in model fit (p-value is smaller than 0.05, so we can reject the null hypothesis of no difference in model fit, on the commonly used significance level). Similarly, moving from model2 to model3 also means statistically significant improvement in model fit. But again, we should also reflect that the improvement, while *statistically* significant,is *not really significant* for many purposes.
In the following tasks, you will not need to code anything. Just think and try to give your best answer before you look at our solution.
# Of course, you DO need to control for outside temperature, because the only plausible explanation is that outside temperature preceeds and causes ice-cream consumption (people are hot and want to eat ice-cream) and drowning (are hot and want to swim).
# The correct answer is that you should only control for the attitudes before the intervention. They preceded the intervention, so you want to put them in the model to compare the intervention and control groups when their attitudes before the intervention are held constant. On the other hand, you should not control for the the attitudes after the intervention because they are a mediator, they are your causal mechanism (you believe it is the change of attitudes caused by the intervention which ultimately results in better chance of finding a job). Controlling for it would undermine your analysis - you would effectively filter out the effect that you are trying to find.
# No. The reason is still the same as above. In fact, when you failed to randomize and you also failed to measure the attitudes prior to the intervention, there is no way of knowing if your intervention had any effect. You do not know if any difference between the two groups in finding a job is caused by your intervention or prior differences in attitudes. A botched research design prevents you from making any conclusions. Well, it should prevent you from making any conclusions. It does not mean there are not such faulty conclusions out there...
# No, such a conclusion could be unwarranted, because it could suffer from the *conditioning on collider* bias. What you would be, in fact, doing is setting up a model in your head where the depending variable (less progressive views) would be caused by being sociology student BUT you would only observe the subset of people who are your friends. That is a big problem because how do people become your friends? For simplicity, we can assume two mechanisms - (a) they are sociology students, spend time with you, you have beers together, so you become friends, (b) they are similar to you, have similar political views, so you pick them as friends. In other words, studying sociology is one cause of friendship with you, being politically progressive is another. It means that the causal effect leads to "friendship with you" from both "being sociology student" and "being progressive". This means that being friends with you is a collider in this analysis. And analyzing the relationship between studying sociology and political views ONLY among your friends is a form of conditioning on collider.