REGRESSION

Linear regression is closely related to correlation. Recall that in correlation we sought to evaluate the relationship between two variables - let's call then X and Y for simplicity. If a relationship is present then there is a Pearson r value less than -0.1 or greater than 0.1 - if no relationship is present then the Pearson r value falls between -0.1 and 0.1.

In regression, we seek to determine whether X can predict Y. For instance, do GRE scores predict success at graduate school? Do MCAT scores predict success at medical school? Does income predict happiness?

The general form of regression is Y = B0 + B1X. Hopefully you remember that this is essentially the equation of a line - the formula you learned in high school would have been Y = MX + B, which can be rewritten as Y = B + MX. In regression models, B0 is a constant and B1 is the coefficient for X. Think of it this way - income may range from $0 to $1,000,000 in our data and our happiness score might only range from 1 to 5. Thus, the regression model needs to tweak the income scores by multiplying them by B1 and adding B0 to predict a score between 1 and 5.

Load the data HERE into a table in R called data.

Running a regression is simple. All you need to do is use the following command:
model = lm(data$V1~data$V2)
summary(model)

You should see output that looks like this:

Call:
lm(formula = data$V1 ~ data$V2)

Residuals:
Min 1Q Median 3Q Max
-36783 -9544 -1284 7467 74017

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38504.2 11889.9 3.238 0.00141 **
data$V2 1260.2 234.9 5.364 2.29e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16040 on 195 degrees of freedom
Multiple R-squared: 0.1286, Adjusted R-squared: 0.1241
F-statistic: 28.77 on 1 and 195 DF, p-value: 2.294e-07

Essentially, what R is telling us is that there is a model that fits - p < 0.05. You will notice that it returns a multiple R-squared value which is the square of correlation coefficient r. It also return B0 and B1 which in this case are 38504.2 and 1260.2, respectively. Thus, the regression equation would be Y = 38504.2 + 1260.2X for this data.

Note, a model might not always fit the data. You can see the linear model by using the following commands:
plot(data$V2~data$V1)
abline(lm(data$V2~data$V1))

Assignment
Using the data HERE, construct regression models for all of the variables in columns 2 to 6 against column 1. Thus, you will be running 5 separate linear regressions. Hand in the results of each model and a plot of each model.