The focus of this blog post is on simple linear regression using R. Simple linear regression is useful for examining or modelling the relationship between two numeric variables. Before looking for the type of relationship between pairs of quantities, it is recommended to conduct a correlation analysis to determine whether there is a linear relationship between these quantities. For this post we will use the classic cars data set to create a linear regression model.
Getting a feel of the data set
Before diving into the model fitting, it is worthwhile to do a quick exploration of the data set to determine if simple linear regression will not be a wasteful endeavour.
Have a peak at the data
The cars data set gives the speed of cars and the distances taken to stop. We will model the relationship between speed and distance. The explanatory variable will be speed while the dependent variable will be distance. Below are the first 6 rows of the data set.
data = cars head(data)
speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10
We can also look at the structure of the entire data set:
'data.frame': 50 obs. of 2 variables: $ speed: num 4 4 7 7 8 9 10 10 10 11 ... $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
We can see that our data consists of 50 observations of 2 variables, namely, speed and dist. Both the variables are numeric and more importantly for simple linear regression, speed (response variable) is a continuous numeric variable.
Visualising the relationship
Visualise the data on a scatterplot to see if a linear relationship exists
plot(cars$speed, cars$dist, main='Scatterplot of variables',ylab = "dist",xlab = "speed")
computing the correlation
The next step will be to calculate the correlation coefficient to determine, more reliably than a scatterplot, whether there is a linear relationship between these quantities.
Quick reveiw of correlation coeficient interpretation
The correlation coefficients range between -1 and 1, where:
- Values that are derived closer to 1 are an indication of a positive linear relationship between the variables.
- Values closer to -1 are an indication of a negative linear relationship.
- Values that are close to or equal to 0 indicate that a linear relationship between the variables does not exist.
A correlation coefficient of 0.8068949 suggests that a linear relationship does exists between the two variables. It is important to note however, that this does not provide any information about whether speed affects dist.
In addition, if a correlation between two variables is not found, it does not necessarily imply that they are independent, as they might have a nonlinear relationship.
Building the model
Having established the possibility of a linear relationship between speed and dist, we now proceed to create a linear regression model for our variables.
Fit the model and produce the summary report
my_model <- lm(data$dist ~ data$speed) summary(my_model)
Call: lm(formula = data$dist ~ data$speed) Residuals: Min 1Q Median 3Q Max -29.069 -9.525 -2.272 9.215 43.201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * data$speed 3.9324 0.4155 9.464 1.49e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary of our model provides us with some information which we will discuss in detail below. We will have a look at the coefficients section of the model summary.
The coefficients section of the model summary
The coefficients section consists of a row for each of the terms included in the model, and the following columns:
- Estimate: The estimated value for each corresponding term in the model. Our model has two terms, the intercept and the slope.
- Std.Error: The standard error of the estimate.The average amount that the coefficient estimates varies from the actual average value of our response variable.
- t value The t-statistic for each coefficient, for testing the null hypothesis that the corresponding coefficient is zero against the alternative that it is different from zero, given the other predictors in the model.
- Pr(>|t|): The p-value for the t-statistic of the hypotheses tests if the corresponding coefficient is equal to zero or not.
In addition, we see the following information about the model we have created:
- Residual standard error A measure of the quality of a linear regression fit.
- Multiple R-squared: The proportion of the variability in the response variable that is explained by the model.
- Adjusted R-squared: A modified version of R-squared that has been adjusted for the number of predictors(explanatory variables) in the model.
- F-statistic: The test statistic for the F-test on the regression model. It tests for a significant linear regression relationship between the response variable and the predictor variables.
- p-value: The p-value for the F-test on the model.
The concept of statistical significance
Before interpreting the results from the summary of our model, we will discuss briefly the concept of statistical significance.
The concept of statistical significance is associated with statistical hypothesis tests, where we have a hypothesis of interest that we want to evaluate. When performing a statistical significance test, a null-hypothesis is initially assumed. The null hypothesis could be something like, this particular parameter does not belong in the model, so it’s correct parameter value is zero. We will then have to decide: do we accept or reject the null hypothesis? To make our decision, we need to analyze our data with a significance test. This is done by computing the probability that the null hypothesis is true (this is the p-value), and if this probability is sufficiently small (traditionally, less than 5% or less than 1%), we reject the null hypothesis and take this result as support of our hypothesis – that the parameter should be included in our model).
It is important to note that the results of a statistical test do not provide absolute value and mathematical certainty, only probability. As a result, a decision to reject the null hypothesis is probably right, but it may be wrong. The measure of the risk of falling in error is called the significance level of the test. This level represents a quantitative estimate of the probability that the observed differences are due to chance. The significance level of a test can be chosen by the researcher as desired.
Interpreting the results from the summary
From the summary of our model, we can see that the p-values associated with the t-test for both our coefficients are both less than 0.05 which indicates that both the intercept and speed are significant predictors in our model.
Furthermore, the p-value for the F-test of the whole model is significantly small (1.49e-12) indicating a significant linear regression relationship between the response variable and the predictor variables.
Lastly, from the Multiple R-squared value we can see that 65.11% of the variability in the response variable is explained by the model.
Ploting the data and fitting the regression line
plot(data$speed, data$dist, main = "Scatterplot of variables") abline(my_model, col=2, lwd=3)
We can also calculate the confidence interval of our coefficients.
2.5 % 97.5 % (Intercept) -31.167850 -3.990340 data$speed 3.096964 4.767853
Moreover, We can also change the level of confidence:
confint(my_model, level = 0.99)
0.5 % 99.5 % (Intercept) -35.706610 0.5484205 data$speed 2.817919 5.0468988
We will now move on to Diagnostic plots to assess the linear regression assumptions.
While the assumptions of a linear model are never perfectly met, we must still check if they are reasonable to work with. The assumptions are listed below:
- The Y-values(or the errors) are independent.
- The Y-values can be expressed as a linear function of the X variable.
- Variation of observations around the regression line is constant (homoscedasticity)
- For given value of X, Y values (or the error) are normally distributed.
The residuals, which are defined as the difference between the observed y values and the predicted y values are useful in helping us assess the linear regression assumptions. Residuals may reveal unexplained patterns in data from the model. By analysing residuals, we can not only verify whether the linear regression assumptions are met, but also improve our model in an exploratory way.
R makes the analysis of residuals fairly easy as the only thing we need to do is plot our model.
par(mfrow=c(2, 2)) plot(my_model)
By default R produces four plots: residuals versus fitted values, a Q-Q plot of standardized residuals, a scale-location plot, and a plot of residuals versus leverage that adds bands corresponding to Cook’s distances of 0.5 and 1.
We will start with the plot that shows residuals against fitted values. This gives us an indication of how well our model fits the data or if there exists a nonlinear relationship between the predictor and response variables that was not captured by the model. A good indication that our model fits the data well is if there are equally spread residuals around a horizontal line without distinct patterns.
In our case, the dotted line on the Residuals vs Fitted plot is equal to zero and represents our fit line. As a result, any point on the fit line has a residual equal to zero. Furthermore, we can see that the residuals are not distributed randomly around the fit line. There exists a nonlinear pattern (red line). This could imply that we may get a better fit if we try a model with a quadratic term included.
Let us look at the second diagnostic plot: Normal Q-Q plot. This plot shows if residuals are normally distributed. This assumption is good if residuals are lined well on the straight dashed line. Analysing the figure, it can be noted that the residuals do not follow a straight line well.
Let’s look at the third diagnostic plot: Scale-Location plot, also called a spread-location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). Ideally, we should see a horizontal line with randomly spread points. In our case, we can see that the residuals are not equally distributed, thus the homoscedasticity assumption is not reasonable.
The last diagnostic plot is a plot of Residuals vs Leverage. This plot helps us identify influential points. It is important to note that not all outliers are necessarily influential in linear regression analysis. The amount of change in the predicted scores if an observation is excluded determines the influence of the observation. Cook’s distance is a good measure of the influence of an observation. The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. Basically, the more the leverage of an observation the greater potential that point has in terms in of influence. In our plot, Cook’s distance is represented by the dotted red lines and the observations outside the dotted line are of interest to us. All observations that lie outside the dotted line have high leverage and the potential for influencing our model is higher.
From our diagnostic analysis, we can deduce that there exists a nonlinear relationship that is not captured by our model. The next step would be to include a quadratic term and determine if we get a better fit. More concepts will be looked at in future blog posts. If you would like to get in touch with us you can contact us here.