Into The Tidyverse

Introduction

Statistics is only as interesting as the research questions we create. However, we cannot hope to answer those questions appropriately without going through the trouble of wrangling data.

The more time we spend digging into the data and noticing irregularities, the more likely we are to make better research questions and more appropriate conclusions. It doesn’t matter how sophisticated our analysis becomes if it’s based on bad data.

In this activity, you will explore a survey about happiness and related factors. By reviewing the data at a high level and then drilling into specific variables, you will gain a better idea about what this survey contains, generate interesting research questions and, with clean data, make better-informed answers to those questions.

Where do I start?

For experienced statisticians, getting a new dataset is like Christmas; we can’t wait to unwrap it and see what it’s hiding. For a novice, this can feel overwhelming.

In this section, we discuss 4 general guidelines that are useful regardless of subject matter.

  1. Load the data (and necessary libraries if using R)
  2. Explore the data and generate hypotheses
  3. Prepare the data for analysis
  4. Perform the appropriate analysis

1. Loading Data and Libraries

So far, this step has been taken for granted. You simply import() a dataset from a web source using the rio library. In the real world, this step can be quite complicated. There are different data sources, file types, permissions and more that can make loading the data more difficult.

Fortunately for you, that level of complexity is beyond the scope of this class. However, it is important to call this out as an important step in data analysis.

Load the data and libraries:

2. Data Exploration and Hypothesis Generation

Knowing what to look for when you get a new dataset takes experience and curiosity. Generally, we start by looking at the whole dataset at a high level, then drill down into specific columns.

Viewing the whole dataset

There are several ways to look at the data at a high level:

  1. View(data) opens up the whole dataset in a new window
  2. glimpse(data) shows you a list of columns and a few examples. It also shows the data types of each column. This is useful because sometimes columns that should be number show up as a category and vice versa. This can occur when there are typos in the dataset or categories as labeled as a number.
  3. head(data) prints out the first few rows of data
#View(happiness)
glimpse(happiness)
Rows: 1,004
Columns: 18
$ Gender                     <chr> "female", "female", "female", "female", "fe…
$ Age                        <int> 20, 19, 20, 22, 20, 20, 20, 19, 18, 19, 19,…
$ Height                     <int> 163, 163, 176, 172, 170, 186, 177, 184, 166…
$ Weight                     <int> 48, 58, 67, 59, 59, 77, 50, 90, 55, 60, 60,…
$ Number_of_siblings         <int> 1, 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 10, 1, …
$ Dominant_hand              <chr> "right handed", "right handed", "right hand…
$ Education                  <chr> "college/bachelor degree", "college/bachelo…
$ Only_child                 <chr> "no", "no", "no", "yes", "no", "no", "no", …
$ Fear_score                 <int> 38, 32, 22, 72, 36, 38, 40, 56, 64, 80, 34,…
$ Religiosity_score          <int> 60, 33, 83, 50, 77, 53, 73, 60, 60, 83, 60,…
$ Smoking                    <chr> "never smoked", "never smoked", "tried smok…
$ Alcohol                    <chr> "drink a lot", "drink a lot", "drink a lot"…
$ I_live_a_healthy_lifestyle <chr> "Agree", "Neither Agree nor Disagree", "Nei…
$ I_like_music               <chr> "Strongly Agree", "Agree", "Strongly Agree"…
$ Music_diversity_score      <int> 33, 44, 63, 34, 51, 58, 48, 43, 38, 58, 46,…
$ Punctuality                <chr> "i am always on time", "i am often early", …
$ Lying                      <chr> "never", "sometimes", "sometimes", "only to…
$ Happiness_score            <int> 13, 10, 11, 6, 11, 10, 12, 12, 7, 11, 10, 9…
#head(happiness)

PRO TIP: Ask a ton of questions throughout this process! You are not always the subject matter expert for an analysis. Find someone who is and learn as much as you can from them.

As you look at the whole dataset, think about what the response variable might be and start to think about what factors might impact that response. Start to mentally rank which factors you think might be most impactful.

QUESTIONS: For our data, what do you think is our response variable?

What other factors do you think might impact the response variable?

Keep in mind that there may be multiple response variables to explore.

Drill down into specific columns

Next we want to look at specific columns, checking for outliers or impossible values.

We usually begin with the response variable then work through the explanatory variables we think are most likely to impact the response variable.

NOTE: It is not always necessary to dig into every column in the data. Sometimes there are dozens (or hundreds) of columns. We need to prioritize based on our own hypotheses or the suggestions of the data provider.

A few commonly used functions for exploring specific columns are:

  • unique() prints out a list of all possible values of a categorical variable
  • table() prints out a list of all possible values of a categorical variable AND the number of times that value occurs
  • histogram() - Making a histogram of a quantitative variable can show if the values are within an expected range
  • favstats() can be used to evaluate if the max and min values are in an acceptable range.

NOTE: We may not always know what “acceptable range” means, but it is our responsibility to find out either by asking a subject matter expert or relying on personal experience.

Make a histogram of happiness score and see if everything looks in order.

hist(happiness$Happiness_score)

Check favstats() to look at the max an min:

favstats(happiness$Happiness_score)
 min Q1 median Q3 max     mean       sd    n missing
 -40  9     11 12  75 10.59661 3.218161 1004       0

Look at the unique values of Gender, Alcohol and Lying.

unique(happiness$Gender)
[1] "female"         "male"           "nunya business"
unique(happiness$Alcohol)
[1] "drink a lot"    "social drinker" "never"          "nunya business"
unique(happiness$Lying)
[1] "never"                         "sometimes"                    
[3] "only to avoid hurting someone" "everytime it suits me"        
[5] "#N/A"                         
table(happiness$Gender)

        female           male nunya business 
           591            412              1 
table(happiness$Alcohol)

   drink a lot          never nunya business social drinker 
           222            120              5            657 
table(happiness$Lying)

                         #N/A         everytime it suits me 
                            2                           137 
                        never only to avoid hurting someone 
                           50                           270 
                    sometimes 
                          545 

Are there any values we may need to filter out?

Excel uses #N/A as a missing value, but in R, this is treated like another level of a categorical variable.

Pick 2 more categorical variables and list their distinct values:


unique(happiness$)

unique(happiness$)
Error: <text>:2:18: unexpected ')'
1: 
2: unique(happiness$)
                    ^

NOTE: For now, we are focusing on categorical explanatory variables. Later in the semester we will learn how to compare 2 quantitative variables.

Generate Hypotheses

Now that we’ve spent time with the data, we can formalize our hypotheses and get ready to prepare the data to perform the proper analysis.

QUESTION: Which 3 explanatory variables do you think have the biggest impact on happiness?
Answer:
1. 2. 3.

3. Prepare the Data

We noticed some irregularities that may need to be addressed to get the most appropriate analysis.

Let’s begin by seeing if there are differences in happiness for different levels of alcohol use.

Review: What are the irregularities in happiness score and alcohol use?

favstats(happiness$Happiness_score)
 min Q1 median Q3 max     mean       sd    n missing
 -40  9     11 12  75 10.59661 3.218161 1004       0
# min = -40, max = 75

unique(happiness$Alcohol)
[1] "drink a lot"    "social drinker" "never"          "nunya business"
# Includes response:   "nunya business"

Suppose Happiness Scores are supposed to be between 0 and 15.

We also want to compare those who “drink a lot” and those who “never drink”.

Replace the blanks below to create a new dataset called alcohol_use that removes the incorrect happiness scores and includes only those who “drink a lot” and “never drink” and includes only those 2 columns


alcohol_use <- happiness %>%
  filter(Happiness_score __ 15, 
         Happiness_score >= __,
         Alcohol ____ c("drink a lot", "never")) %>%
  select(Alcohol, Happiness_score)

#View(alcohol_use)
Error: <text>:3:26: unexpected input
2: alcohol_use <- happiness %>%
3:   filter(Happiness_score _
                            ^

Check the histogram of happiness scores in the alcohol_use data:

histogram(alcohol_use$Happiness_score)
Error in eval(expr, envir, enclos): object 'alcohol_use' not found

How would you describe the shape of the distribution?

Create a side-by-side boxplot and summary statistics table for each level of alcohol use:

boxplot(alcohol_use$Happiness_score ~ alcohol_use$Alcohol)
Error in eval(predvars, data, env): object 'alcohol_use' not found

How different do the happiness scores appear to be between these two groups?

Are you disappointed?

It’s important to understand that sometimes finding nothing is finding something.

Lying

Repeat the above analysis comparing how peoples’ happiness scores depend on their attitudes about lying.

  1. Create a new dataset called lying that excludes the #N/A values in Lying and the Happiness scores outliers, and includes only “Happiness_score” and “Lying”:

lying_data <- happiness %>%
  filter(Happiness_score <= __, 
         _______________ >= 0, 
         Lying __ "#N/A"
  ) %>%
  ______(Happiness_score, Lying)

#View(lying_data)
Error: <text>:3:30: unexpected input
2: lying_data <- happiness %>%
3:   filter(Happiness_score <= __
                                ^
  1. Create a side-by-side boxplot and summary statistics (using favstats()) table for each attitude about Lying:

boxplot(____________________ ~ lying_data$Lying)

favstats(lying_data$Happiness_score ~ _______________)
Error: <text>:2:10: unexpected input
1: 
2: boxplot(__
            ^

Choose your own adventure

Pick another categorical variable with multiple groups that may impact happiness score.

Be sure to clean out any ‘nunya business’ or ‘#N/A’ that might be there and clean out the outliers in happiness:

new_data <- happiness %>%
  filter(
    
  ) %>%
  select()


boxplot()
Error in boxplot.default(): argument "x" is missing, with no default

What conclusions can you make?