library(mosaic)
library(tidyverse)
library(ResourceSelection)
library(pander)
“The General Social Survey (GSS) conducts basic scientific research on the structure and development of American society with a data-collection program designed to both monitor societal change within the United States and to compare the United States to other nations.”\(^1\) It is a cooperative effort to survey the American people every couple years asking a wide variety of questions and has been going on since 1972. The 2012 data file is contained in the file GSS2012.csv
.
#The GSS2012.csv file is actually a "tab" instead of "comma" separated file, so we have to use fancy code to read it in:
GSS <- read.table("../../../Data/GSS2012.csv", sep="\t", header=TRUE)
Each column name in the GSS
dataset corresponds to a variable name in the General Social Survey. All variable names can be browsed in the 2012 General Social Survey Browser.
Many questions could be answered with the GSS
data. There are 1,974 individuals that responded to the 2012 survey and 818 variables were recorded for at least some of these individuals. Data for all 818 variables was not collected on every individual.
colnames(GSS)
into your R Console after “Importing the Dataset” into R.
This analysis will explore the answer to the question,
Does a person’s frequency of prayer predict whether they believe divorce should be easier or more difficult to obtain than it is now?
Age could be a confounding factor, as older people may tend to feel differently about divorce laws than younger people, so it will be included in the model as a covariate. The variables divlaw
, pray
, and age
will be used to analyze this question.
Looking the variables of divlaw
, pray
, and age
up in the online GSS variable browser we gather the following information.
divlaw
: Should divorce in this country be easier or more difficult to obtain than it is now?
Response | Meaning |
---|---|
1 | Easier |
2 | More difficult |
3 | Stay Same |
0 | IAP, question is inapplicable to this person for some reason. |
8 | DK, Don’t know |
9 | NA, question not asked to this individual. |
barplot(table(GSS$divlaw), xlab="divlaw", col=c("gray","skyblue","skyblue","gray","gray","gray","gray"))
This barchart shows that of the people that responded to the question (answers 1, 2, and 3) the feelings are pretty strong that divorce laws should either be stricter (2) or less strict (1) although there are still a fair number that think it is just right (3) or are unsure how they feel (8). Only individuals that responded as a 1 or 2 will be used in this analysis.
pray
: How often does respondant pray?
Response | Meaning |
---|---|
1 | Several times a day |
2 | Once a day |
3 | Several times a week |
4 | Once a week |
5 | Less than once a week |
6 | Never |
0 | IAP, question is inapplicable to this person for some reason. |
8 | DK, Don’t know |
9 | NA, question not asked to this individual. |
barplot(table(GSS$pray), xlab="pray", col=c(rep("skyblue",6),"gray","gray"))
This barchart shows that most people that responded to this question claim they pray each day, with many praying several times a day (values of 1 and 2). Fewer pray seldomly (3-5) than those that never pray (6). Idividuals with a response of 8 or 9 will be omitted from the analysis.
age
: Age of respondent.
Response | Meaning |
---|---|
18-88 | Age in years |
89 | 89 years old or older |
98 | DK, Don’t know |
99 | NA, question not asked to this individual. |
hist(GSS$age, col='skyblue')
This histogram shows the distribution of ages of those that answered all survey questions of interest. A good range of values is represented with the majority being middle aged. Indiviuals with values of 98 and 99 will be omitted from this analysis.
To answer our question, we will use a multiple logistic regression of the form
\[ P(Y_i = 1|\, x_{i1},x_{i2},x_{i3},x_{i4},x_{i5},x_{i6}) = \frac{e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_6 x_{i6}}}{1+e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_6 x_{i6}}} = \pi_i \]
where
Variable | Value | Explanation |
---|---|---|
\(x_{i1}\) | 18 to 88 | the age of the individual |
\(x_{i2}\) | 1 if pray == 2 , 0 otherwise |
individual prays once a day |
\(x_{i3}\) | 1 if pray == 3 , 0 o.w. |
individual prays several times a week |
\(x_{i4}\) | 1 if pray == 4 , 0 o.w. |
individual prays once a week |
\(x_{i5}\) | 1 ifpray == 5 , 0 o.w. |
individual prays less than once a week |
\(x_{i6}\) | 1 ifpray == 6 , 0 o.w. |
individual never prays |
\(Y_i=1\) will denote a person feels it should be more difficult to get a divorce than it is now. \(Y_i=0\) will denote a person that feels it should be easier than it is now to get a divorce.
The hypotheses for this study will concern the coefficients of the various responses for pray
as well as the age
covariate. Each will be tested at the \(\alpha = 0.05\) level.
\[ H_0: \beta_1 = 0 \text{ (age has no effect)} \\ H_a: \beta_1 \neq 0 \text{ (age has an effect)} \]
\[ H_0: \beta_2 = 0 \text{ (praying once a day has no effect)}\\ H_a: \beta_2 \neq 0 \text{ (praying once a day has an effect)} \]
\[ H_0: \beta_3 = 0 \text{ (praying several times a week has no effect)}\\ H_a: \beta_3 \neq 0 \text{ (praying several times a week has an effect)} \]
\[ H_0: \beta_4 = 0 \text{ (praying once a week has no effect)}\\ H_a: \beta_4 \neq 0 \text{ (praying once a week has an effect)} \]
\[ H_0: \beta_5 = 0 \text{ (praying less than once a week has no effect)}\\ H_a: \beta_5 \neq 0 \text{ (praying less than once a week has an effect)} \]
\[ H_0: \beta_6 = 0 \text{ (never praying has no effect)}\\ H_a: \beta_6 \neq 0 \text{ (never praying has an effect)} \]
Because there are some missing values (NA, DK, Can’t choose, and IAP) in the data, we need to filter the data before analyzing it. (This is sometimes called data cleaning.)
Click open the code button at the right to see how the data was cleaned for this study.
# Cleaned version of GSS
GSSc <- GSS %>%
select(divlaw,age,pray) %>% #choose only 3 variables of interest
filter(pray %in% 1:6,#keep only good values
divlaw %in% c(1,2),
age <= 89) %>%
mutate(pray = as.factor(pray))
Now for the multiple logistic regression using GSSc
.
GSSc.glm <- glm( (divlaw == 2) ~ age + pray, data=GSSc, family=binomial)
summary(GSSc.glm) %>% pander(caption="Multiple Logistic Regression Summary")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -0.3032 | 0.2322 | -1.306 | 0.1917 |
age | 0.01713 | 0.00398 | 4.305 | 1.672e-05 |
pray2 | -0.3424 | 0.1724 | -1.986 | 0.04699 |
pray3 | -0.7159 | 0.254 | -2.819 | 0.004816 |
pray4 | -0.4237 | 0.2781 | -1.524 | 0.1275 |
pray5 | -0.3964 | 0.2533 | -1.565 | 0.1176 |
pray6 | -0.9507 | 0.231 | -4.115 | 3.875e-05 |
(Dispersion parameter for binomial family taken to be 1 )
Null deviance: | 1280 on 926 degrees of freedom |
Residual deviance: | 1231 on 920 degrees of freedom |
See the end of this document for the interpretation on the results of this logistic regression.
The Hosmer-Lemeshow test will be used to test the goodness of fit of this logistic regression model. The null assumes the logistic regression is a good fit. As shown below, there is insufficient evidence to reject the null \((p = 0.7823)\) so we will conclude a good logistic fit on these data.
hoslem.test(GSSc.glm$y, GSSc.glm$fitted) %>% pander(caption="Hosmer-Lemeshow Goodness-of-Fit Test")
Test statistic | df | P value |
---|---|---|
4.766 | 8 | 0.7823 |
par(mai=c(1,1.3,.8,.1))
palette(rev(c("firebrick3","firebrick2","orange","wheat","skyblue","skyblue4")))
plot((divlaw == 2) + as.numeric(as.character(pray))*.01-.03 ~ age, data=GSSc, pch=16, cex=0.5, xlim=c(18,120), col=as.factor(pray), ylab="Probability of Favoring Stricter \n Divorce Laws", main="Age and Faith Lead to Greater Opposition of Divorce")
curve(exp(-0.30321 + 0.01713*x)/(1+exp(-0.30321 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[1])
curve(exp(-0.30321-0.34240 + 0.01713*x)/(1+exp(-0.30321-0.34240 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[2])
curve(exp(-0.30321-0.71595 + 0.01713*x)/(1+exp(-0.30321-0.71595 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[3])
curve(exp(-0.30321-0.42372 + 0.01713*x)/(1+exp(-0.30321-0.42372 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[4])
curve(exp(-0.30321-0.39636 + 0.01713*x)/(1+exp(-0.30321-0.39636 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[5])
curve(exp(-0.30321-0.95067 + 0.01713*x)/(1+exp(-0.30321-0.95067 + 0.01713*x)), from=18, to=88, add=TRUE, col=palette()[6])
legend("right", legend=c("Many times a Day", "Once a Day", "Sev. Times a Week", "Once a Week", "Rarely", "Never"), title="Prays", col=palette(), lty=1, bty="n", cex=0.8)
Although age
was not of direct interest in this study, it can be seen that for any level of pray
the effect of age
on the odds is \(e^{0.01713} = 1.0172776\) for every one year increase in age. Thus, the odds of supporting stricter laws on divorce is higher for the older population than it is for the younger population. This is the reason age
was included as a covariate. It seemed logical from the beginning that age
would have such an effect, so including it in the model allowed us to focus on the effect of pray
while accounting for this assumedly known effect of age
.
We interpret the pray1
level, which is the intercept in the current model, as the odds that a 0-year old person (illogical), would support stricter divorce laws, \(e^{-0.30321} = 0.738444\). Note that this interpretation isn’t especially meaningful. This is because pray1
is the baseline level to which all other references will be made.
We interpret the pray2
variable as the effect praying once a day (as opposed to several times a day) will have on the odds that a person will favor stricter divorce laws, \(e^{-0.34240} = 0.7100641\). This means there is a 29% drop in the odds of favoring stricter divorce laws for those that pray once a day instead of several times a day.
The interpretation of the other variables is similar. For example, the coefficient of the pray3
variable is interpreted as the change in the odds that a person will favor stricter divorce laws given they pray several times a week instead of several times a day. This change is \(e^{-0.71595} = 0.4887276\), showing just over a 50% drop in the odds.
The most substantial drop in the odds is for those that never pray, pray6
. Which comes out to be \(e^{-0.95067} = 0.386482\), or roughly a 62% drop in the odds.
The final conclusion is that there is sufficient evidence to conclude that those that pray several times a day have higher odds of supporting stricter divorce laws than those that never pray. (The answer is not as clear for the various levels of frequency of prayer, but is very clear for these two categories.)
Here are a few interesting situations to consider.
probs <- predict(GSSc.glm, data.frame(age=c(20,50,80), pray=as.factor(c(6,1,1))), type="response")
Age | Frequency of Prayer | Probability of Favoring Strictor Divorce Laws |
---|---|---|
20 | Never | 0.2867533 |
50 | Several Times a Day | 0.6349388 |
80 | Several Times a Day | 0.7441144 |
This shows that roughly 3 out of 4 of those 80 years old that pray frequently (several times a day) will favor stricter divorce laws while those who are in their twenties and never pray sit closer to 1 in 4 who favor the stricter divorce laws.