Linear Regression

Linear Regression with One Independent Variable

Simple linear regression fits a line to a group of paired data points so that one number in the pair, called the independent variable (IV) can be used to predict the second, called the dependent variable (DV). In Equation (2) below, the is the IV for the ith data point, and the is its DV. The true (the y-axis intercept) and (the IV’s coefficient) are unknown, but the least squares statistical procedure fits an equation to the data, Equation (2) below, to obtain estimates their corresponding estimates and from which an estimate of can be calculated.

Yi = α0+α1 Xi+εi 1)
Ÿi = α0+α1 Xi 2)

The least squares procedure calculates the estimated parameters shown in Equation (2) by minimizing the average sum of squared error terms .

Figure 1 below is a scatterplot of pairs of data related to various models of automobiles. The data pairs consist of automobile weight and the miles per gallon (mpg) that automobile model obtains. The purpose of fitting a regression to this data is to predict mpg, shown on the y-axis, based on automobile weight, shown on the x-axis. As would be expected, the data show a downward trend, with mpg decreasing as weight increases, and the regression line fit to the data therefore has a negative slope. For a particular value of weight, the distance between the actual data point and the corresponding point on the line is the error term for that automobile.


Figure 1. Scatterplot of Automobile Data with Simple Regression Line

If a regression model is an appropriate one with which to predict the independent variable, which is typically the variable of interest, should follow a normal distribution with mean 0. One easy way to judge the effectiveness of the model in predicting the variable of interest is to calculate and plot the residuals, the differences between the actual values and those predicted by the model. Figure 2 below shows the residuals for the automobile regression model. These appear to display no up or down trend and do tend to cluster about the center. If one were able to tilt Figure 2’s plot onto its edge so that the data points dropped off the graph and to a tabletop, they would form a pile that looked somewhat like a normal probability distribution. Therefore, they indicate that the regression model provides a good fit to the data.


Figure 2. Regression Model Residuals

Linear Regression with Multiple Independent Variables

A linear regression having more than one independent variable functions in much the same way, although it is somewhat harder to visualize. Equations (3) and (4) correspond to Equations (1) and (2), respectively. In this model, the number of independent variables is some number n.

Yi1 = α0+α1 Xi +α2 Xi2 +α3 Xi3 +…αn Xi+εi 3)
Ÿi1 = α0+α1 Xi +α2 Xi2 +α3 Xi3 +…αn Xi 4)

As in simple linear regression, the IV’s coefficients indicate the relationship of the DV to that independent variable. In the automobile example above, mpg and weight were negatively correlated, meaning that as weight increased, mpg decreased. Consequently, the value of in the example above was negative. In a similar manner, the signs of the coefficients in Equation (4) reflect the relationship of those IVs and.

Figure 3 below shows a three-dimensional plot of automobile weight and displacement used to predict mpg. These linear multiple regression data are used to fit a regression model, which in a three-dimensional case, is a plane rather than a line. In the simple regression above, the model with parameters included was as follows:

The model fitted to include displacement data is as follows:

indicating that mpg and displacement are also negatively correlated. Displacement’s coefficient is much smaller than weight’s however, possibly indicating a much weaker correlation. Figure 3’s plot shows this; if the weight dimension were removed, the resulting line, shown on the left-hand wall of the box, slopes much less steeply than that on the “back” wall of the box, which corresponds to the line in Figure 1.


Figure 3. Scatterplot of Three Measures Associated with an Automobile and the Regression Plane Fitted to this Data


This section looked at the concept and performance of multiple regression However, the validity of a multiple regression analysis rests on a series of assumptions that should be tested. Moreover, determining exactly which independent variables should be included in the final regression model adopted is not always obvious. We’ll look at these assumptions and how these are tested in the next section.