Logistic Regression Examples

library(mosaic)
library(tidyverse)
library(ResourceSelection)
library(pander)

Background

“The General Social Survey (GSS) conducts basic scientific research on the structure and development of American society with a data-collection program designed to both monitor societal change within the United States and to compare the United States to other nations.”\(^1\) It is a cooperative effort to survey the American people every couple years asking a wide variety of questions and has been going on since 1972. The 2012 data file is contained in the file GSS2012.csv.

#The GSS2012.csv file is actually a "tab" instead of "comma" separated file, so we have to use fancy code to read it in:
GSS <- read.table("../../../Data/GSS2012.csv", sep="\t", header=TRUE)

Each column name in the GSS dataset corresponds to a variable name in the General Social Survey. All variable names can be browsed in the 2012 General Social Survey Browser.

Hypotheses & Questions

Many questions could be answered with the GSS data. There are 1,974 individuals that responded to the 2012 survey and 818 variables were recorded for at least some of these individuals. Data for all 818 variables was not collected on every individual.

To see a list of all the variable names that were included in the 2012 survey type the command colnames(GSS) into your R Console after “Importing the Dataset” into R.

This analysis will explore the answer to the question,

Does a person’s frequency of prayer predict whether they believe divorce should be easier or more difficult to obtain than it is now?

Age could be a confounding factor, as older people may tend to feel differently about divorce laws than younger people, so it will be included in the model as a covariate. The variables divlaw, pray, and age will be used to analyze this question.

Summary of Variables Used

Looking the variables of divlaw, pray, and age up in the online GSS variable browser we gather the following information.

divlaw

divlaw: Should divorce in this country be easier or more difficult to obtain than it is now?

Response	Meaning
1	Easier
2	More difficult
3	Stay Same
0	IAP, question is inapplicable to this person for some reason.
8	DK, Don’t know
9	NA, question not asked to this individual.

barplot(table(GSS$divlaw), xlab="divlaw", col=c("gray","skyblue","skyblue","gray","gray","gray","gray"))

This barchart shows that of the people that responded to the question (answers 1, 2, and 3) the feelings are pretty strong that divorce laws should either be stricter (2) or less strict (1) although there are still a fair number that think it is just right (3) or are unsure how they feel (8). Only individuals that responded as a 1 or 2 will be used in this analysis.

pray

pray: How often does respondant pray?

Response	Meaning
1	Several times a day
2	Once a day
3	Several times a week
4	Once a week
5	Less than once a week
6	Never
0	IAP, question is inapplicable to this person for some reason.
8	DK, Don’t know
9	NA, question not asked to this individual.

barplot(table(GSS$pray), xlab="pray", col=c(rep("skyblue",6),"gray","gray"))

This barchart shows that most people that responded to this question claim they pray each day, with many praying several times a day (values of 1 and 2). Fewer pray seldomly (3-5) than those that never pray (6). Idividuals with a response of 8 or 9 will be omitted from the analysis.

age

age: Age of respondent.

Response	Meaning
18-88	Age in years
89	89 years old or older
98	DK, Don’t know
99	NA, question not asked to this individual.

hist(GSS$age, col='skyblue')

This histogram shows the distribution of ages of those that answered all survey questions of interest. A good range of values is represented with the majority being middle aged. Indiviuals with values of 98 and 99 will be omitted from this analysis.

Multiple Logistic Regression Model

To answer our question, we will use a multiple logistic regression of the form

\[ P(Y_i = 1|\, x_{i1},x_{i2},x_{i3},x_{i4},x_{i5},x_{i6}) = \frac{e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_6 x_{i6}}}{1+e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_6 x_{i6}}} = \pi_i \]

where

Variable	Value	Explanation
\(x_{i1}\)	18 to 88	the age of the individual
\(x_{i2}\)	1 if `pray == 2`, 0 otherwise	individual prays once a day
\(x_{i3}\)	1 if `pray == 3`, 0 o.w.	individual prays several times a week
\(x_{i4}\)	1 if `pray == 4`, 0 o.w.	individual prays once a week
\(x_{i5}\)	1 if`pray == 5`, 0 o.w.	individual prays less than once a week
\(x_{i6}\)	1 if`pray == 6`, 0 o.w.	individual never prays

\(Y_i=1\) will denote a person feels it should be more difficult to get a divorce than it is now. \(Y_i=0\) will denote a person that feels it should be easier than it is now to get a divorce.

The hypotheses for this study will concern the coefficients of the various responses for pray as well as the age covariate. Each will be tested at the \(\alpha = 0.05\) level.

\[ H_0: \beta_1 = 0 \text{ (age has no effect)} \\ H_a: \beta_1 \neq 0 \text{ (age has an effect)} \]

\[ H_0: \beta_2 = 0 \text{ (praying once a day has no effect)}\\ H_a: \beta_2 \neq 0 \text{ (praying once a day has an effect)} \]

\[ H_0: \beta_3 = 0 \text{ (praying several times a week has no effect)}\\ H_a: \beta_3 \neq 0 \text{ (praying several times a week has an effect)} \]

\[ H_0: \beta_4 = 0 \text{ (praying once a week has no effect)}\\ H_a: \beta_4 \neq 0 \text{ (praying once a week has an effect)} \]

\[ H_0: \beta_5 = 0 \text{ (praying less than once a week has no effect)}\\ H_a: \beta_5 \neq 0 \text{ (praying less than once a week has an effect)} \]

\[ H_0: \beta_6 = 0 \text{ (never praying has no effect)}\\ H_a: \beta_6 \neq 0 \text{ (never praying has an effect)} \]

Data Analysis

Because there are some missing values (NA, DK, Can’t choose, and IAP) in the data, we need to filter the data before analyzing it. (This is sometimes called data cleaning.)

Filter the Data

Click open the code button at the right to see how the data was cleaned for this study.

# Cleaned version of GSS
GSSc <- GSS %>%
  select(divlaw,age,pray) %>% #choose only 3 variables of interest
  filter(pray %in% 1:6,#keep only good values
         divlaw %in% c(1,2),
         age <= 89) %>%
  mutate(pray = as.factor(pray))

Logistic Regression Test

Now for the multiple logistic regression using GSSc.

GSSc.glm <- glm( (divlaw == 2) ~ age + pray, data=GSSc, family=binomial)
summary(GSSc.glm) %>% pander(caption="Multiple Logistic Regression Summary")

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.3032	0.2322	-1.306	0.1917
age	0.01713	0.00398	4.305	1.672e-05
pray2	-0.3424	0.1724	-1.986	0.04699
pray3	-0.7159	0.254	-2.819	0.004816
pray4	-0.4237	0.2781	-1.524	0.1275
pray5	-0.3964	0.2533	-1.565	0.1176
pray6	-0.9507	0.231	-4.115	3.875e-05

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	1280 on 926 degrees of freedom
Residual deviance:	1231 on 920 degrees of freedom

See the end of this document for the interpretation on the results of this logistic regression.

Goodness-of-Fit Test

The Hosmer-Lemeshow test will be used to test the goodness of fit of this logistic regression model. The null assumes the logistic regression is a good fit. As shown below, there is insufficient evidence to reject the null \((p = 0.7823)\) so we will conclude a good logistic fit on these data.

hoslem.test(GSSc.glm$y, GSSc.glm$fitted) %>% pander(caption="Hosmer-Lemeshow Goodness-of-Fit Test")

Hosmer-Lemeshow Goodness-of-Fit Test
Test statistic	df	P value
4.766	8	0.7823

Visualization of Results

par(mai=c(1,1.3,.8,.1))

palette(rev(c("firebrick3","firebrick2","orange","wheat","skyblue","skyblue4")))
plot((divlaw == 2) + as.numeric(as.character(pray))*.01-.03 ~ age, data=GSSc, pch=16, cex=0.5, xlim=c(18,120), col=as.factor(pray), ylab="Probability of Favoring Stricter \n Divorce Laws", main="Age and Faith Lead to Greater Opposition of Divorce")

curve(exp(-0.30321 + 0.01713*x)/(1+exp(-0.30321 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[1])

curve(exp(-0.30321-0.34240 + 0.01713*x)/(1+exp(-0.30321-0.34240 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[2])

curve(exp(-0.30321-0.71595 + 0.01713*x)/(1+exp(-0.30321-0.71595 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[3])

curve(exp(-0.30321-0.42372 + 0.01713*x)/(1+exp(-0.30321-0.42372 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[4])

curve(exp(-0.30321-0.39636 + 0.01713*x)/(1+exp(-0.30321-0.39636 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[5])

curve(exp(-0.30321-0.95067 + 0.01713*x)/(1+exp(-0.30321-0.95067 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[6])

legend("right", legend=c("Many times a Day", "Once a Day", "Sev. Times a Week", "Once a Week", "Rarely", "Never"), title="Prays", col=palette(), lty=1, bty="n", cex=0.8)

Interpretation

Although age was not of direct interest in this study, it can be seen that for any level of pray the effect of age on the odds is \(e^{0.01713} = 1.0172776\) for every one year increase in age. Thus, the odds of supporting stricter laws on divorce is higher for the older population than it is for the younger population. This is the reason age was included as a covariate. It seemed logical from the beginning that age would have such an effect, so including it in the model allowed us to focus on the effect of pray while accounting for this assumedly known effect of age.

We interpret the pray1 level, which is the intercept in the current model, as the odds that a 0-year old person (illogical), would support stricter divorce laws, \(e^{-0.30321} = 0.738444\). Note that this interpretation isn’t especially meaningful. This is because pray1 is the baseline level to which all other references will be made.

We interpret the pray2 variable as the effect praying once a day (as opposed to several times a day) will have on the odds that a person will favor stricter divorce laws, \(e^{-0.34240} = 0.7100641\). This means there is a 29% drop in the odds of favoring stricter divorce laws for those that pray once a day instead of several times a day.

The interpretation of the other variables is similar. For example, the coefficient of the pray3 variable is interpreted as the change in the odds that a person will favor stricter divorce laws given they pray several times a week instead of several times a day. This change is \(e^{-0.71595} = 0.4887276\), showing just over a 50% drop in the odds.

The most substantial drop in the odds is for those that never pray, pray6. Which comes out to be \(e^{-0.95067} = 0.386482\), or roughly a 62% drop in the odds.

The final conclusion is that there is sufficient evidence to conclude that those that pray several times a day have higher odds of supporting stricter divorce laws than those that never pray. (The answer is not as clear for the various levels of frequency of prayer, but is very clear for these two categories.)

Here are a few interesting situations to consider.

probs <- predict(GSSc.glm, data.frame(age=c(20,50,80), pray=as.factor(c(6,1,1))), type="response")

Age	Frequency of Prayer	Probability of Favoring Strictor Divorce Laws
20	Never	0.2867533
50	Several Times a Day	0.6349388
80	Several Times a Day	0.7441144

This shows that roughly 3 out of 4 of those 80 years old that pray frequently (several times a day) will favor stricter divorce laws while those who are in their twenties and never pray sit closer to 1 in 4 who favor the stricter divorce laws.