Like ordinary least squares regression, logistic regression employs one or more independent variables to predict a dependent variable. Unlike OLS regression, the dependent variable is not continuous but is instead a categorical variable that takes on the values of 0 and 1. Like OLS, fitting a logistic regression produces coefficients of independent variables whose values communicate the effects of incremental changes in them on the dependent variable.

Given that the Y can assume the values of 0 and 1 and that there are p independent variables, the probability of Y taking on the value of 1 is as follows :

P ( Y = 1 ) = 1 / ( 1 + exp ⁡{ α + β X+ β X+⋯+ β X} )               (1)

In interpreting results, employing the odds ratio is helpful because it increases accessibility of the model coefficients. Mathematically, an odds ratio can be defined as follows :

                                      Odds Ratio=(P(Y=1))/(1-P(Y=1))                  (2)

For instance, the odds of a thoroughbred winning a race are 9 to 1. The probability of the horse winning is then assumed to be .9, and so, of course, the probability of its losing is .1. In this case, the odds ratio as calculated in the equation above would be 9.

To restate the (1) employing the odds ratio and to increase its interpretability, the natural log of both sides of Equation (1) are taken, resulting in (3) below.

                       ln⁡(Odds Ratio)= α+β1 X12 X2+⋯+βp Xp                     (3)

A logistic regression was fit to data containing 400 observations and concerning admissions to graduate school.[1] The dataset’s variables are “admit,” which takes on the value 1 if admitted and 0 if not admitted and functions as the model dependent variable, and independent variables GRE score (“gre”) and grade point average (“gpa”).  R’s glm() function with family equal to “binomial” was fitted to the data and produced the output shown in Table 1.

Table 1. Output of Logistic Regression with Two Independent Variables.

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.949378 1.075093 -4.604 4.15e-06 **
admissions$gre 0.002691 0.001057 2.544 0.0109 *
admissions$gpa 0.754687 0.319586 2.361 0.0182 *
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

All coefficients are statistically significant at a minimum of p = .05. The corresponding model is as follows:

P ( Y = 1) = 1
1 + exp ⁡{ -4.949378 + 0.002691X1 + 0.754687X2 }

Thus, the probability of a student with a GRE of 500 and a GPA of 4.25 being admitted is 0.598. The odds ratio of admittance is 1.487 or almost 1.5 to 1, meaning that approximately 3 out of 5 applicants with similar credentials would be admitted.

A measure of the goodness of fit for a logistic regression is the log-likelihood measure. The deviation given in Table 2 is a multiple of the log-likelihood:

deviance = 2log-likelihood (4)

 

where the deviance has a Chi-square distribution.

Table 2. Dispersion Parameter for Logistic Regression Fitted to Two Independent Variables.

(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 480.34 on 397 degrees of freedom
AIC: 467.44

 

A new variable was added to the model, the “rank” of the undergraduate institution (rank = 1, 2, …, 5 with a higher number indicating a higher rank), and its output is shown in Table 3.

Table 3. Output of Logistic Regression with Three Independent Variables.

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.449548 1.132846 -3.045 0.00233 **
admissions$gre 0.002294 0.001092 2.101 0.03564 *
admissions$gpa 0.777014 0.327484 2.373 0.01766 *
admissions$rank -0.560031 0.127137 -4.405 1.06E-05 ***
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

As can be seen, the new variable is also statistically significant, at p = 0.001. The resulting model is then the following:

P ( Y = 1) = 1
1 + exp ⁡{ -3.44955 + 0.00229X1 + 0.77014X2 - 0.560031X3}

 

In this model probability of a student with a GRE of 500, a GPA of 4.25, and an undergraduate degree from a third ranked college or university being admitted is 0.664. The odds ratio of admittance is 1.975 or almost 2 to 1, i.e., 2 out of 3 applicants with similar credentials would be admitted. The dispersion parameter was calculated for this also and is shown in Table 4.

Table 4. Dispersion Parameter for Logistic Regression with Three Independent Variables.

(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 459.44 on 396 degrees of freedom
AIC: 467.44

 

It can be used to determine whether the new model represents an improvement over the previous one. The ratio of the two models’ likelihoods has a Chi-square distribution with degrees of freedom equaling the number of new variables added, 1 in this case. This likelihood ratio equals 20.90222 in this case and so has a p-value of 4.83335e-06. Since this is statistically significant, the new model incorporating undergraduate school rank has added to the effectiveness of the model in predicting admission status.

[1] These data are located at www.ats.ucla.edu/stat/data/binary.csv.