Comparing Means

Unit 2 Review

Introduction

In this activity, you will use everything we’ve covered up to this point including:

  • Data manipulation using tidyverse functions
  • Hypothesis tests
    • 1-Sample t-test
    • 2-sample dependent t-test
    • 2-sample independent t-test
    • ANOVA
  • Confidence Intervals where applicable

We will be using data collected about students in 2 Portuguese schools including their final grade. The goal is to answer research questions using statistical methods to see what factors significantly impact final grades.

Getting to know a new dataset

In class, we have reinforced a process for approaching a new dataset. The following is a summary of activities that help us conduct good research:

  • Read in the data
  • Explore the dataset as a whole:
    • What are the column names? What do they mean? Where can I find information about them?
    • What is the response/dependent variable? Could there be more than one?
    • What are some factors that may impact the response variable? Which are likely the most important?
  • Explore specific columns
    • Start with the response variable. Are there any outliers? Obtain summary statistics (favstats()), visualize the data (histogram(), boxplot()).
    • Explore the explanatory variables you think are most impact to the response variable. What type of data are they (categorical, quantitative)? For categorical variables, what are all the levels (unique())
  • Formalize statistical hypotheses. If your factors are categorical, how many groups will you be comparing? Is it a 1-sample t-test, 2-sample t-test, ANOVA?
  • Prepare data for analysis. You may need to clean the data (eg. data %>% filter() %>% select())
  • Perform the appropriate analysis (t.test(), aov())

All these activities are important, but we may spend more or less time on any one of them depending on the state of the data.

Load the Libraries and Data

Review the data

Take some time to familiarize yourself with the data. Check the website to see what the columns are.

What is the response variable?

Create a histogram of the response variable.

What do you notice about the shape of the distribution?

What anomalies, if any, do you notice?

Calculate summary statistics for the response variable.

Question: What is the minimum?
Answer:

Question: What is the maximum?
Answer:

Question: How many students participated in the survey?
Answer:

Question: Rank order the top 5 explanatory variables you think most influence the response and identify each as a categorical or quantitative variable:

NOTE: Some of the above variables may be quantitative, which is great! Next unit will cover how to analyze those relationships. This assignment, however, focuses on comparing differences between groups and only considers categorical explanatory variables variables.

Preparing data for analysis

Categories Labeled as Numbers

Sometimes even correct data can have issues that must be addressed. For example, categories are often labeled as numbers. Software can’t guess when numbers are supposed to be categories, so we have to tell R when a number should be treated as a category.

To force a variable to be a category, we use the factor() function in R. We can change the variable type in the data itself or change it in the analysis. We demonstrate both methods below.

Changing a Column Type in a dataset

Father’s education, Fedu, shows up as a number in R. The website suggests that the numbers represent categories (0 = none, 1 = primary education (4th grade), 2 = 5th to 9th grade, 3 = secondary education or 4 = higher education).

To change the data type in the data itself, we can use a mutate statement in the following manner:

# Create a new dataset called fedu_data that begins with the clean data and adds a column that we called Fedu_factor, which is the factorized column, Fedu:

new_data <- student %>%
  mutate(Fedu_factor = factor(Fedu))

# Check the column names of the new dataset.  Notice the new column
names(new_data)

# glimpse() shows us data types.  Notice after Fedu_factor, the <fct>, which shows us that this is in fact, a factor variable type.  <dbl> stands for "double" and is a numeric variable type

glimpse(new_data)

Changing the Variable Type “on the fly”

You may not want to bother changing all the variable types for each potential analysis. Fortunately, you can create a factor “on the fly” within the analysis function itself.

Because there are more than 2 levels of Father’s Education, I will demonstrate how this is done in an ANOVA:

# Force Fedu to be treated like a category in ANOVA:
fedu_anova <- aov(student$G3 ~ factor(student$Fedu))

summary(fedu_anova)
                      Df Sum Sq Mean Sq F value Pr(>F)  
factor(student$Fedu)   4    238   59.53   2.891 0.0222 *
Residuals            390   8032   20.59                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This works with most analysis functions including t.test() and aov().

NOTE: You only have to do this for variables in a dataset that are categories labeled as numbers. If the categories are text, t.test() and aov() automatically recognizes the variable as categorical. However, it does no harm to put a column with text into a factor() statement.

Cleaning the Data

While exploring the data, you may have noticed a few students ended up with a final grade of zero. While it may be interesting to explore what factors lead to an incomplete grade, we want to make conclusions about students who completed the course.

Create a clean dataset called, clean, that excludes zeros for G3. This will be used for the following analyses.

Perform the Appropriate Analysis

Comparing Schools

Suppose the Gabriel Pereira school (GP) has more stringent admissions requirements. We suspect this would lead to higher grades, on average.

Create a side-by-side boxplot of the final grades for each school. Change the y-axis label to read “Final Grade out of 20”, the x-axis label to read “School”, and add a title.

Question: What do you notice?
Answer:

Create a table of summary statistics of final grade for each school:

Hypothesis Test

Create a qqPlot to look at the normality of both groups:

Question: Do the grades look normally distributed for both groups? If not, should we be concerned?
Answer:

Question: Can we trust the P-value?
Answer:

State your null and alternative hypotheses and significance level.

NOTE: Recall that R uses alphabetical order to determine which group is the reference group. It is useful to put this group on the left side of the null hypothesis and set your alternative hypothesis accordingly.

\[H_o: \]

\[H_a: \]

\[\alpha = 0.\]
Perform the appropriate statistical test:

Question: What is the P-value?
Answer:

Question: What is your conclusion in context of the research question?
Answer:

Confidence Interval

Create a \((1-\alpha)\)% confidence interval and explain it in context of the research question.

Explanation:

Comparing Second Period Grade with Final Grade

We suspect there is a difference between the second period and the final grade, though we do not know if they go up or down. Carry out a hypothesis test to evaluate this suspicion.

Hypothesis Test

Choose how you will define the difference between final grade and second period grade, and create a new object called diff:

diff <- 
Error: <text>:2:0: unexpected end of input
1: diff <- 
   ^

Question: What does a negative number mean?
Answer:

Create a qqPlot() of diff and check for normality:

Question: Do the grade differences look normally distributed? If not, should we be concerned?
Answer:

Question: Can we trust the P-value?
Answer:

State your null and alternative hypothesis and choose a significance level:

\[H_o: \]

\[Ha: \]

\[\alpha = 0.\]

Perform the appropriate analysis.

Question: What is the P-value?
Answer:

Question: What conclusion do you make in context of this research question?
Answer:

Confidence Interval

Create a \((1-\alpha)\)% confidence interval for the differences and explain it in context of the research question.

Explanation:

Absenteeism in Portugal

In 2021, Portugal reported having 0% absenteeism for 15-year-olds. We suspect that the actual absenteeism is higher than the reported value (zero).

Hypothesis Test

Create a qqPlot() for absences.

Question: Do absences look normally distributed? If not, should we be concerned?
Answer:

Question: Can we trust the P-value?
Answer:

State your null and alternative hypotheses and choose a significance level:

\[H_o: \]

\[H_a: \]

\[\alpha = 0.\]

Perform the appropriate analysis.

Question: What is the P-value?
Answer:

Question: What conclusion do you make in context of this research question?
Answer:

Confidence Interval

Create a \((1-\alpha)\)% confidence interval for average absences and interpret it in context of the problem.

Explanation:

The Impact of Mother’s Education Level

The level of education of the mother in the home is thought to have a significant impact on student success.

Create a side-by-side boxplot of final grades for each level of mother’s education.

Create a table of summary statistics of final grades for each level of mother’s education.

Question: How many respondents have a mother with no formal education (level 0)?
Answer:

Create a new dataset, clean_medu, that does not include mother’s education level 0.


clean_medu <- clean %>% 
Error: <text>:4:0: unexpected end of input
2: clean_medu <- clean %>% 
3: 
  ^

Create another boxplot with the new dataset that excludes level 0.

Create a summary table of final grades for each level of a mother’s education with the new dataset.

Question: What is the maximum standard deviation?
Answer:

Question: What is the minimum standard deviation?
Answer:

Question: Verify that the maximum is less than twice as large as the minimum to check the “equality of standard deviations”.
Answer:

Hypothesis Test

State your null and alternative hypotheses and pick alpha:

\[H_o: \]

\[H_a: \]

\[\alpha = 0.\]

Perform the appropriate statistical test.

Question: What is the test statistic?
Answer:

Question: What is the P-value?
Answer:

Check the normality of the residuals.

Question: Do the residuals appear roughly normally distributed?
Answer:

Question: Can we trust the P-value.
Answer:

State your conclusion.

Choose your own adventure

Pick another variable that was not analyzed above.

Create a side-by-side boxplot. Be sure to properly label the graph and add sufficient information so readers can know what they are looking at without having to search through the report or code.

Perform the appropriate analysis. Be sure to include a concise conclusion in the context of the research question, including a hypothesis test (and confidence interval if applicable.)