filter()

Choosing Rows

Good scientists NEVER delete data from original records. The tidyverse allows us to create a new, clean dataset with a transparent set of steps from which we can create graphs, visualizations and analyses without losing any of the original data.

We here demonstrate how to define criteria for choosing which rows to include from the data. Consider the High School survey data which consists of responses to 60 questions from 312 high school students.

# Load libraries and data

library(rio)
library(mosaic)
library(tidyverse)
library(car)

survey <- import('https://github.com/byuistats/Math221D_Cannon/raw/master/Data/HighSchoolSeniors_subset.csv') %>% tibble()

Logical Operators

Logical operators are used extensively in computer programming to evaluate if a specified condition is met. They return a “True” or a “False” but are often treated as 1’s and 0’s respectively.

The most common logical operators used to filter rows are:

< and <= means “less than” and “less than or equal to” respectively
> and >= means “greater than” and “greater than or equal to” respectively
== means “equal to” (NOTE: we use double equals because in most computer languages, a single = is an assignment operator. This avoids ambiguity)
!= means “not equal to”; this one is useful if you want to eliminate one level of a variable
%in% selects specified levels you want to include

Removing Outliers

One common use offilter() is to remove outliers. Favstats of Height_cm shows that the tallest person is 999 cm. This means there is at least one impossible value we may not want to include in a visualization or analysis.

favstats(survey$Height_cm)

  min  Q1 median      Q3 max     mean       sd   n missing
 1.68 161    170 178.125 999 169.2412 53.54382 312       0

We can create a new dataset, clean, that excludes responses of people who are taller than 250cm (about 8 feet):

clean <- survey %>% 
  filter(Height_cm < 250)

# the dim() function shows us how many rows and columns in a dataset
dim(survey)

[1] 312  60

dim(clean)

[1] 310  60

The above code will return a new dataset without the outliers.

How many rows does the original dataset have?

How many rows does the filtered dataset have?

Filtering on Categorical Data

Suppose for some reason, we only want to include right- or left-handed people (excluding ambidextrous). We can add multiple conditions in the filter() function separated by a comma:

# See what the distinct values are in the Handed column
unique(survey$Handed)

[1] "Left-Handed"  "Right-Handed" "Ambidextrous"

survey %>%
  filter(Height_cm < 250,
         Handed != "Ambidextrous")

# A tibble: 302 × 60
   Country Region DataYear ClassGrade Gender Ageyears Handed       Height_cm
   <chr>   <chr>     <int>      <int> <chr>     <dbl> <chr>            <dbl>
 1 USA     FL         2022         12 Male         18 Left-Handed        182
 2 USA     IN         2022         12 Male         17 Right-Handed       190
 3 USA     GA         2022         12 Female       17 Right-Handed       172
 4 USA     NC         2022         11 Female       15 Right-Handed       163
 5 USA     CO         2022         12 Female       17 Left-Handed         51
 6 USA     MO         2022         11 Male         17 Right-Handed       181
 7 USA     SC         2022         11 Female       18 Right-Handed       160
 8 USA     WA         2022         11 Female       16 Right-Handed       156
 9 USA     WA         2022         12 Female       17 Right-Handed       169
10 USA     WA         2022         11 Male         18 Right-Handed       160
# ℹ 292 more rows
# ℹ 52 more variables: Footlength_cm <dbl>, Armspan_cm <dbl>,
#   Languages_spoken <dbl>, Travel_to_School <chr>,
#   Travel_time_to_School <int>, Reaction_time <dbl>,
#   Score_in_memory_game <dbl>, Favourite_physical_activity <chr>,
#   Imprtance_reducing_pllutin <int>, Imprtance_recycling_rubbish <int>,
#   Imprtance_cnserving_water <int>, Imprtance_saving_energy <int>, …

# We could also try using %in% instead of  "!="

survey %>%
  filter(Height_cm < 250,
         Handed %in% c("Left-Handed", "Right-Handed"))

# A tibble: 302 × 60
   Country Region DataYear ClassGrade Gender Ageyears Handed       Height_cm
   <chr>   <chr>     <int>      <int> <chr>     <dbl> <chr>            <dbl>
 1 USA     FL         2022         12 Male         18 Left-Handed        182
 2 USA     IN         2022         12 Male         17 Right-Handed       190
 3 USA     GA         2022         12 Female       17 Right-Handed       172
 4 USA     NC         2022         11 Female       15 Right-Handed       163
 5 USA     CO         2022         12 Female       17 Left-Handed         51
 6 USA     MO         2022         11 Male         17 Right-Handed       181
 7 USA     SC         2022         11 Female       18 Right-Handed       160
 8 USA     WA         2022         11 Female       16 Right-Handed       156
 9 USA     WA         2022         12 Female       17 Right-Handed       169
10 USA     WA         2022         11 Male         18 Right-Handed       160
# ℹ 292 more rows
# ℹ 52 more variables: Footlength_cm <dbl>, Armspan_cm <dbl>,
#   Languages_spoken <dbl>, Travel_to_School <chr>,
#   Travel_time_to_School <int>, Reaction_time <dbl>,
#   Score_in_memory_game <dbl>, Favourite_physical_activity <chr>,
#   Imprtance_reducing_pllutin <int>, Imprtance_recycling_rubbish <int>,
#   Imprtance_cnserving_water <int>, Imprtance_saving_energy <int>, …

NOTE: The %in% and the != will not always give you the same results. If there are misspellings or other options, using %in% will limit the data to only those with the exact spelling in the list provided. For example, someone responding “left handed” (all lower case), would not be included in the clean data. Misspellings would, however, be included if I use != "Ambitextrous" because that only removes rows written exactly that way. Things like, ambidextrious or RightHanded would still be included.

I could further limit my data to students from Florida and Missouri:

new_data <- survey %>%
  filter(Height_cm < 250,
         Handed %in% c("Left-Handed", "Right-Handed"),
         Region %in% c("MO", "FL"))

dim(new_data)

[1] 27 60

How many rows does our latest dataset have?

Your turn

How many vegetarians in the survey say that meat is their favorite food?

The Vegetarian column has “Yes” and “No” as possible options.

The Favorite_Food column has “Meat”, “Pizza/Pasta”, “Breads/Sandwiches”, “Rice/Noodle dishes”, “Poultry”, “Seafood”, “Vegetables”, “Desserts”, “Fruit”, “Salads”, “No favorite”, “Other” and “Soups” as options.

vegetarians <- survey %>%
  filter()