Regression Data

A number of assumptions govern statistical regression analysis:

Variables must be continuous or categorical. The former type may be any real number, but the second is restricted to integers. For instance, a categorical IV could be a respondent’s gender, which would be represented in the regression data analysis as a pair of integers, typically 0 and 1. The DV, however, must be a real number. (See logistic regression for cases where the dependent variable is an integer.)

Multicollinearity, a high level of correlation between two or more IVs in a model, should be minimized. Some multicollinearity may be unavoidable, and its effect is to reduce confidence in the model by increasing variability of model parameters, which can be thought of as variables in their own right, by limiting R2, the amount of variability present within the data for which the model accounts, and increasing the difficulty in determining relative importance of individual IVs. As we will see, the IVs that should be included in a model are not always. A researcher will most likely have to select some subset of those available for the final model.

Two obvious ways to identify multicollinearity are (1) to calculate the correlations between IVs and (2) to calculate each IV’s variance inflation factor or its reciprocal, the tolerance factor, both of which statistical softwares can generate. The correlation matrix of correlations between IVs shows highly correlated variables, allowing the researcher to eliminate all but the one of those that are highly correlated. The VIF plays a similar role.

The data should display a linear relationship. If it does not, a nonlinear regression model would be more appropriate. The horseshoe-shaped pattern of plotted residuals is one indication of a quadratic relationship between independent and dependent variables.

The data should exhibit homoscedasticity, that is, their variance should be constant across all levels of the IVs. Those that do not are termed heteroscedastic.

The residuals, discussed in the previous section, should be normally distributed with a mean of 0. Note in Figure 2 that the automobile regression model’s residuals tend to cluster around 0. A scatterplot or histogram is frequently enough to determine how closely a regression model’s residuals satisfy this assumption.

Not only can individual IVs be correlated, but the residuals can also be correlated, a condition known as autocorrelation. It occurs when the data exhibit a repeating pattern, such as seasonality in sales data or a recurring signal. The runs charts employed in quality management use serial correlation as an indicator of a problem in a process. The Durbin-Watson test is used to detect autocorrelation. Once detected, transforming the data is one method with which to remove it.

There need to be sufficient data to generate a valid regression model. A rule of thumb says to have 10 or 15 cases per IV. However, this represents a simplification. A researcher determines the power of regression test statistics and can control power level through sample size. Also, sample size influences the size of the model’s effect, i.e., how effective it is at predicting). In all these, more is better, and determining sample size is typically a tradeoff between increased accuracy and cost of obtaining the data.

Ways to Test the Validity of a Regression Model

Statistical Tests

Various regression testing statistics exist, both to test the validity of the overall model and to test the appropriateness of including individual IVs. Among the former is the coefficient of determination R2, with , although it will rarely be exactly 0 or 1. In simple terms, R2 measures the proportion of variability present in the data that the regression model captures, and the larger R2 is the better. The adjust R2 attempts to adjust the coefficient of determination based on the sample to a corresponding value based on the population, if that were possible. Thus, it measures the loss in predictive power based on using the sample rather than the entire population from which the sample was drawn.

Also measuring the effectiveness of the entire model is the F statistic. As in an ANOVA, regression segments the variability present in data into that for which the model accounts and that attributable to random error, and a model’s F statistic is the ratio of the former over the latter. The F statistic is governed by the F distribution, thus allowing testing of statistical significance of the model.

For instance, in the multiple regression model above, the multiple R2 was .7809, the adjusted R2 was .7658, and the F statistic was 51.69 with degrees of freedom 2 and 29. (These latter represent the number of IVs and the former the sample size less the number of IVs less 1. Our automobile sample contained 32 car models, and number of predictors, or IVs, was 2. The probability of obtaining an F statistic as extreme as 51.69 with degrees of freedom of 2 and 29 was found to be 2.744e-10, the probability of obtaining a value at least as extreme as 51.69. Classical statistics assumes that we are testing a null hypothesis that all model coefficients were 0; at a pre-selected level of statistical significance of even .001, we would reject this null hypothesis because 2.744e-10 < .001. Among regression model statistics that validate individual parameter estimates (i.e., the model IV coefficients) are t-values associated with individual IV coefficients. The statistical software, R in this case, will generate a table like the one in Table 1, which is a crucial part of a presentation of regression testing statistics.

Table 1. Output of a Multiple Regression Analysis

Key Specs

Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.96055 2.16454 16.151 4.91E-16 ***
weight -3.35082 1.16413 -2.878 0.00743 **
displacement -0.01773 0.00919 -1.929 0.06362 .
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In judging the significance of coefficient values, the t-values and their corresponding probabilities, in columns 4 and 5, respectively, are important. We are, in a sense, treating the IV’s coefficients as variables rather than as fixed values, and t, rather than z values, are used because the variance of these coefficients is unknown and must be estimated. The probabilities in Table 1 allow statistical testing, in this case, comparing the probability of obtaining t-values as extreme as those we have with our pre-determined level of significance. As can be seen, the variable weight’s estimate has statistical significance at the .01 level, and displacement’s at the .1 level. We have somewhat less confidence in the latter’s ability to predict mpg than the former’s.

Splitting the Data

Another method through which to explore the validity of a multiple regression model is to split the data, and use one portion to fit the model. Once obtained, the model can then be used to predict the DV values of the other part of the data. A comparison of results can reveal how accurately the regression is at prediction.

Choice of Independent Variables

Another decision a researcher employing any type of multiple regression faces is which variables to include in the model. To a certain extent, this decision is based on expert knowledge of the subject matter. Including all possibilities is another option; the t-values can be examined to determine which seem to have a greater significance and should be retained. Also, examination of the correlation matrix of the IVs indicates more highly correlated variables, only one of which should be retained. Finally, procedures exist with which to introduce IV and judge the marginal effect of doing so.