# Load libraries and data
library(rio)
library(mosaic)
library(tidyverse)
library(car)
<- import('https://github.com/byuistats/Math221D_Cannon/raw/master/Data/HighSchoolSeniors_subset.csv') %>% tibble() survey
filter()
Choosing Rows
Good scientists NEVER delete data from original records. The tidyverse allows us to create a new, clean dataset with a transparent set of steps from which we can create graphs, visualizations and analyses without losing any of the original data.
We here demonstrate how to define criteria for choosing which rows to include from the data. Consider the High School survey data which consists of responses to 60 questions from 312 high school students.
Logical Operators
Logical operators are used extensively in computer programming to evaluate if a specified condition is met. They return a “True” or a “False” but are often treated as 1’s and 0’s respectively.
The most common logical operators used to filter rows are:
<
and<=
means “less than” and “less than or equal to” respectively>
and>=
means “greater than” and “greater than or equal to” respectively==
means “equal to” (NOTE: we use double equals because in most computer languages, a single=
is an assignment operator. This avoids ambiguity)!=
means “not equal to”; this one is useful if you want to eliminate one level of a variable%in%
selects specified levels you want to include
Removing Outliers
One common use offilter()
is to remove outliers. Favstats of Height_cm shows that the tallest person is 999 cm. This means there is at least one impossible value we may not want to include in a visualization or analysis.
favstats(survey$Height_cm)
min Q1 median Q3 max mean sd n missing
1.68 161 170 178.125 999 169.2412 53.54382 312 0
We can create a new dataset, clean
, that excludes responses of people who are taller than 250cm (about 8 feet):
<- survey %>%
clean filter(Height_cm < 250)
# the dim() function shows us how many rows and columns in a dataset
dim(survey)
[1] 312 60
dim(clean)
[1] 310 60
The above code will return a new dataset without the outliers.
How many rows does the original dataset have?
How many rows does the filtered dataset have?
Filtering on Categorical Data
Suppose for some reason, we only want to include right- or left-handed people (excluding ambidextrous). We can add multiple conditions in the filter()
function separated by a comma:
# See what the distinct values are in the Handed column
unique(survey$Handed)
[1] "Left-Handed" "Right-Handed" "Ambidextrous"
%>%
survey filter(Height_cm < 250,
!= "Ambidextrous") Handed
# A tibble: 302 × 60
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm
<chr> <chr> <int> <int> <chr> <dbl> <chr> <dbl>
1 USA FL 2022 12 Male 18 Left-Handed 182
2 USA IN 2022 12 Male 17 Right-Handed 190
3 USA GA 2022 12 Female 17 Right-Handed 172
4 USA NC 2022 11 Female 15 Right-Handed 163
5 USA CO 2022 12 Female 17 Left-Handed 51
6 USA MO 2022 11 Male 17 Right-Handed 181
7 USA SC 2022 11 Female 18 Right-Handed 160
8 USA WA 2022 11 Female 16 Right-Handed 156
9 USA WA 2022 12 Female 17 Right-Handed 169
10 USA WA 2022 11 Male 18 Right-Handed 160
# ℹ 292 more rows
# ℹ 52 more variables: Footlength_cm <dbl>, Armspan_cm <dbl>,
# Languages_spoken <dbl>, Travel_to_School <chr>,
# Travel_time_to_School <int>, Reaction_time <dbl>,
# Score_in_memory_game <dbl>, Favourite_physical_activity <chr>,
# Imprtance_reducing_pllutin <int>, Imprtance_recycling_rubbish <int>,
# Imprtance_cnserving_water <int>, Imprtance_saving_energy <int>, …
# We could also try using %in% instead of "!="
%>%
survey filter(Height_cm < 250,
%in% c("Left-Handed", "Right-Handed")) Handed
# A tibble: 302 × 60
Country Region DataYear ClassGrade Gender Ageyears Handed Height_cm
<chr> <chr> <int> <int> <chr> <dbl> <chr> <dbl>
1 USA FL 2022 12 Male 18 Left-Handed 182
2 USA IN 2022 12 Male 17 Right-Handed 190
3 USA GA 2022 12 Female 17 Right-Handed 172
4 USA NC 2022 11 Female 15 Right-Handed 163
5 USA CO 2022 12 Female 17 Left-Handed 51
6 USA MO 2022 11 Male 17 Right-Handed 181
7 USA SC 2022 11 Female 18 Right-Handed 160
8 USA WA 2022 11 Female 16 Right-Handed 156
9 USA WA 2022 12 Female 17 Right-Handed 169
10 USA WA 2022 11 Male 18 Right-Handed 160
# ℹ 292 more rows
# ℹ 52 more variables: Footlength_cm <dbl>, Armspan_cm <dbl>,
# Languages_spoken <dbl>, Travel_to_School <chr>,
# Travel_time_to_School <int>, Reaction_time <dbl>,
# Score_in_memory_game <dbl>, Favourite_physical_activity <chr>,
# Imprtance_reducing_pllutin <int>, Imprtance_recycling_rubbish <int>,
# Imprtance_cnserving_water <int>, Imprtance_saving_energy <int>, …
NOTE: The %in%
and the !=
will not always give you the same results. If there are misspellings or other options, using %in%
will limit the data to only those with the exact spelling in the list provided. For example, someone responding “left handed” (all lower case), would not be included in the clean data. Misspellings would, however, be included if I use != "Ambitextrous"
because that only removes rows written exactly that way. Things like, ambidextrious
or RightHanded
would still be included.
I could further limit my data to students from Florida and Missouri:
<- survey %>%
new_data filter(Height_cm < 250,
%in% c("Left-Handed", "Right-Handed"),
Handed %in% c("MO", "FL"))
Region
dim(new_data)
[1] 27 60
How many rows does our latest dataset have?
Your turn
How many vegetarians in the survey say that meat is their favorite food?
The Vegetarian
column has “Yes” and “No” as possible options.
The Favorite_Food
column has “Meat”, “Pizza/Pasta”, “Breads/Sandwiches”, “Rice/Noodle dishes”, “Poultry”, “Seafood”, “Vegetables”, “Desserts”, “Fruit”, “Salads”, “No favorite”, “Other” and “Soups” as options.
<- survey %>%
vegetarians filter()