4 Text and Time

The tasks in this unit will help prepare you to do the following:

  1. These will be updated soon.

The case studies, which appears at the end of this unit, give you an opportunity to demonstrate mastery of the objectives above.

4.1 Task: String Reading

Wrangling human created text can be one of the most difficult parts of the wrangling process. This task will have you focus your efforts on learning to work with library(stringr) in the tidy verse.

  1. Read R4DS: Strings. As you read, do at least two of the practice exercises from each section. Get as far as you can in one hour. You are welcome to perform your computations in a script (.R) or markdown (.Rmd) file.
  2. When you are done, push your work to your GitHub repo. Then pick at least two exercises you would like to discuss (perhaps ones that were very helpful, or very difficult). Be prepared to share them with your team.

4.2 Task: String Practice

“Global regular expression print” (grep) and “regular expressions” (regex) are used to find character string patterns. These tools are available with all operating systems and many different programming languages. Once understood, they are valuable and powerful tools for data analysis. The library(stringr) package makes these tools much easier to use.

Here are a few other resources that may help wrangle this messy text beast.

In library(tidyverse) is a vector called sentences that contains 720 sentences on which we can practice wrangling text. Before running a command that works on all 720 sentences, we can run the command on a smaller subset, so we can validate our code on a small subset. We can examine the first 10 sentences using sentences[1:10], or we could pick a random sample of 10 sentences using sample(sentences,10). Beware that when using a random sample, your work will not be reproducible on someone else’s computer unless you use the set.seed() command. The following chunk of code shows the length of the sentences vector, as well as how to get a subset of these sentences in various ways.

length(sentences)
sentences[1:10]
sample(sentences,10)
set.seed(123)
sample(sentences,10)

Because the sentences object is a vector, we cannot use dplyr tools on it unless we first put the vector into a tibble (a tidy table). The following chunk of code places the first 4 sentences into a tibble, and then counts the number of times the string “the” shows up in each sentence. It then uses str_view_all() to show you a visual of where the string appeared.

tibble(
  sentence = sentences[1:4],
  count = str_count(sentence, "the")
)
str_view_all(sentences[1:4],"the")
  1. Modify the code above so that it counts the number of times the word “the” appears in any form (so as “The” or “the”), but make sure that words such as “These” or “them” are not included in your counts. Once you have validated your code works on the first 4 sentences, try grabbing a random sample of a few other sentences and validate your code on them. Then run the code on the entire set of 720 sentences to count the number of times the word “the” appears in each sentence. End by summarizing your results and state the number of sentences that have the word “the” exactly 0 times, exactly 1 time, exactly 2 times, exactly 3 times, etc.

  2. Run the following code and explain what each line of code does.

    month.abb
    (months <- str_c(month.abb, collapse = "|"))
    (sentences_with_month <- str_subset(sentences, months))
    str_view_all(sentences_with_month,months)
    
    tibble(sentence = sentences) %>% 
      mutate(location = str_locate(sentence,months)) %>% 
      mutate(id = row_number()) %>% 
      drop_na(location)
    
    sentences[1]
    sentences[1] %>% str_split(pattern = "")
    (my_vec <- sentences[1] %>% str_split(pattern = "") %>% unlist())
    seq(0,10,3)
    my_vec[seq(0,10,2)]
    
    sentences[2]
    sentences[2] %>% str_split(pattern = "\\s") %>% unlist()
  3. Use the readr::read_lines() function to read in these two strings - randomletters.txt and randomletters_wnumbers.txt. We’ll be using these text files to locate some hidden messages.

    • With the randomletters.txt file, pull out every 1700 letter and find the quote that is hidden. The quote ends with a period (there may be some extra letters at the end). You should be pulling out the 1st letter, then the 1700th, then the 3400th, and then continue to count by 1700. [Hint: The str_split("") and unlist() commands, from the example above, will let you convert the text into a vector. Then you can use vector notation to pick out the letters you want.]
    • With the randomletters_wnumbers.txt file, find all the numbers hidden and convert those numbers to letters using the letters order in the alphabet to decipher the message. The message starts with “experts”.
    • With the randomletters.txt file, remove all the spaces and periods from the string and then find the longest sequence of vowels.
  4. With any remaining time, pick something else you might count/extract/detect/locate, in each sentence of the sentences vector. Then practicing counting, detecting, extracting, and locating. Challenge yourself. Here are some examples.

    • How many characters are in each sentence? How many words are in each sentence?
    • How many times does the a vowel show up in each sentence? How many consonants are there? How many punctuation marks?
    • How many times does a pair of letters show up next to each other (such as “oo” or “ll” or “ss”).
    • Detect whether or not the sentence contains a number (so one/two/three/etc. in written form).
    • Detect if the sentence contains a month (you can use month.name and/or month.abb to help). Where in the sentence does the month appear?
    • What is the longest word in each sentence? Where does the longest word appear? How long is the longest word in each sentence.

4.3 Task: Counting words

In 1978 Susan Easton Black penned an article in the Ensign titled Names of Christ in the Book of Mormon which claims “even statistically, he is the dominant figure of the Book of Mormon”. Similar to Susan Black, we are going to use our string skills to count words and occurrences in the New Testament and in the Book of Mormon.

  1. What is the average verse length (number of words) in the New Testament compared to the Book of Mormon?
  2. How often is the word “Jesus” in the New Testament compared to the Book of Mormon?
  3. What does the distribution of verse word counts look like for each book in the Book of Mormon?

These functions are worth exploring to help you answer the three questions.

The forcats cheat sheet might help with your graph for #3, if you want the books names to appear in the same order as they do in the Book of Mormon.

  1. Examine the documentation for the stringi::stri_stats_latex() function. Then run the following lines of code to examine various ways to use this function.

    sentences[1]
    stringi::stri_stats_latex(sentences[1])
    stringi::stri_stats_latex(sentences[1])["Words"]
    
    sentences[1:4]
    stringi::stri_stats_latex(sentences[1:4])
    stringi::stri_stats_latex(sentences[1:4])["Words"]
    
    tibble(sentence = sentences[1:4]) %>% 
      mutate(words = stringi::stri_stats_latex(sentence)["Words"])
    
    #We can group_by() a unique identifier before calling an unvectorized function
    tibble(sentence =  sentences[1:4]) %>% 
      mutate(id = row_number()) %>% 
      group_by(id) %>% 
      mutate(words = stringi::stri_stats_latex(sentence)["Words"]) %>% 
      ungroup()
    
    #For those familiar with the map() command in functional programming languages... 
    tibble(sentence =  sentences[1:4]) %>% 
      mutate(words = map_int(sentence, function(x){stringi::stri_stats_latex(x)["Words"]})) 

    Note that the function returns a list of values, and so to get just the number of words, we use ["Words"] after the function call to select just this element from the list. In addition, the stringi::stri_stats_latex() has not been vectorized to return a vector of outputs, which is why we had to group_by() a unique identifier above before running the command.

  2. Import the scripture data from http://scriptures.nephi.org/downloads/lds-scriptures.csv.zip, and make sure it is a tibble (tidy table).

    scriptures_data <- rio::import("http://scriptures.nephi.org/downloads/lds-scriptures.csv.zip") %>% 
      as_tibble()
  3. Write code that outputs the answers to questions 1 and 2.

  4. Create a visualization that addresses question 3. Write a short paragraph describing what you learn about books in the Book of Mormon.

  5. With any time remaining, come up with your own questions about the scriptures which we could analyze. Then answer some of them (by making tables and or visualizations).

4.4 Task: Regex Look Arounds

In Case Study 5, we’ll be analyzing the headlines from two different news outlets, one in California (KCRA) and one in New York (ABC7NY). The data spans July 18, 2017 - Jan 16, 2018. Our goal will be to identify the 15 cities with the highest headline count overall.

For today’s task, we’ll analyze the headlines from one of these news outlets, namely ABC7NY, and focus on how we can use regular expression look arounds to help us refine our searches. The following code loads the data, glimpses it, and then provides a quick summary of how many times a few cities appear in the headlines, utilizing str_flatten() to turn a vector of city names into a search pattern (by inserting “|” between each city).

abc_data <- read_csv("https://storybench.org/reinventingtv/abc7ny.csv")
abc_data %>% glimpse()
cities <- c("Sandy","Las Vegas","Charlotte","Stockton","New York","Rexburg","Moore") %>% 
  str_flatten(collapse = "|")
cities
abc_data %>% 
  mutate(city = str_extract(headline, pattern = cities)) %>% 
  count(city)

When you run the code above, you’ll find that Rexburg and Stockton (a city in California) don’t appear in the headlines from this New York news outlet. However, you should find that “Sandy” (which is a town in Utah) appears 117 times in these headlines. Do these references point to Sandy, UT, or do they refer to something else? Once we’ve analyzed “Sandy”, we can repeat this analysis on other town names as well.

To learn more about regular expression look arounds (lookahead and lookbehind) see the bottom of the second page of this cheat sheet, or search for a help page on regular expression look arounds.

  1. Run the code above to load in the data and view the summary counts, verifying that “Sandy” appears 117 times in the headlines. Then run each line of code below, and explain to your self (or a rubber duck) what that line of code does.

    headlines <- abc_data %>% select(headline)
    headlines %>% glimpse()
    
    headlines %>% 
      mutate(keep = str_detect(headline,"Sandy")) %>% 
      filter(keep == TRUE) 
    #Is Sandy Kenyon a person?
    
    headlines %>% 
      mutate(keep = str_detect(headline,"Sandy(?! Kenyon)")) %>% 
      filter(keep == TRUE) 
    #What was Superstorm Sandy"?
    
    headlines %>% 
      mutate(keep = str_detect(headline,"(?<!Superstorm )Sandy(?! Kenyon)")) %>% 
      filter(keep == TRUE) 
    #What is Sandy Hook?
    
    headlines %>% 
      mutate(keep = str_detect(headline,"(?<!Superstorm )Sandy(?! Kenyon)(?! Hook)")) %>% 
      filter(keep == TRUE) 
    #How many of the remaining references are for Sandy, UT?

    Were you able to determine what (?! Kenyon) and (?<!Superstorm ) do in the code above (they are examples of a lookahead and lookbehind)?

    By the time you’re done, you’ll have reduced the original 117 “Sandy” references down to 7 (of which none probably refer to Sandy, UT). The point here is that we were able to quickly reduce 117 references down to 7, with very little effort. Even though “Sandy” appears in the data quite a bit, we don’t want to include Sandy in a report that includes the 15 cities with the highest headline count overall.

  2. Repeat an analysis, similar to the above, to explore Charlotte, a large city in North Carolina. Are the references that find “Charlotte” pointing to Charlotte, NC, or are they pointing to somewhere (or something) else? If they are pointing to a different place, what place is it?

  3. Can you explain why these two chunks of code return different results?

    abc_data %>% 
      mutate(city = str_extract(headline, "Charlotte|Charlottesville")) %>% 
      count(city)  
    abc_data %>% 
      mutate(city = str_extract(headline, "Charlottesville|Charlotte")) %>% 
      count(city)    

    How does order affect search patterns?

  4. Pick another city and repeat this analysis. Note that you will have combed through almost 10,000 headlines each time you perform your analysis.

  5. Once you feel comfortable combing through the data and using look arounds to help refine your searches, feel free to move to the next task, which is to begin Case Study 5.

4.5 Task: Begin Case Study 5

Each case study throughout the semester will ask you to demonstrate what you’ve learned in the context of an open ended case study. The individual prep activity will always include a “being” reading which will lead to an in class discussion.

  1. Complete the Being Reading section of Case Study 5. This will require that you read an article(s), and come to class with two or three things to share.

  2. Download the data for the case study and begin exploring it. Your goal at this point is to simply to understand the data better. As needed, make charts and or summary tables to help you explore the data. Be prepared to share any questions you have about the data with your team.

  3. Identify what visualizations this case study wants you to create. For each visualization, construct a paper sketch, listing the aesthetics (x, y, color, group, label, etc.), the geometries, any faceting, etc., that you want to appear in your visualization. Then construct on paper an outline of the table of data that you’ll need to create said visualization. You do not need to actually perform any computations in R, rather create rough hand sketches of the visualization and table that you can share with your team in class.

  4. With any remaining time, feel free to start transferring your plans into code, and perform the wrangling and visualizing with R.

4.6 Task: Dates and times Reading

  1. Read R4DS: Dates and times. As you read, do at least two of the practice exercises from each section. Get as far as you can in one hour. You are welcome to perform your computations in a script (.R) or markdown (.Rmd) file.
  2. When you are done, push your work to your GitHub repo. Then pick at least two exercises you would like to discuss (perhaps ones that were very helpful, or very difficult). Be prepared to share them with your team.

4.7 Task: Does the weather hurt my bottom line?

A car wash business, from Rexburg Idaho, wants to see if the temperature hurts their bottom line. They have point of sale data for the months of April, May, June, and July. You will need to aggregate the data into hourly sales totals and merge the sales data together with the temperature data to provide insight into the relationship between temperature and car wash sales.

Here are some additional materials that may help you tackle this task.

Here are your tasks.

  1. Read in the car wash data from https://byuistats.github.io/M335/data/carwash.csv and format it for the needs of this task. Remember that readr::read_csv() automatically brings .csv data in as a tibble, while rio::import() does not.

    • Convert the times from UTC time to mountain time using the correct function from library(lubridate). Note that Rexburg, ID, uses the Denver time zone.
    • Create a new hourly grouping variable using ceiling_date() from library(lubridate).
    • For each hour, summarize the data by providing the total amount of sales made that hour. This is often referred to as “aggregate the point of sale data into hourly sales totals.”
  2. Use riem_measures(station = "RXE", date_start = , date_end = ) for station RXE (Rexburg) from library(riem) to get the matching temperatures for the given hours.

    • You’ll need to figure out the start/end dates to properly load the function.
    • Create a new hourly variable that matches your car wash hourly variable, so you can merge the two data sets.
    • You’ll have to decide how to deal with the situations in which more than one temperature is reported for a given hour.
  3. Merge the two data sets together. For each hour you should have a temperature and a total sales.

  4. Create a visualization that provides insight into the relationship between sales and temperature by hour of the day.

4.8 Task: Factors

When we want to display textual data in pre-configured non-alphabetical order, we need more than just characters and strings. This is the role played by factors.

  1. Read R4DS: Factors. There are a grand total of 8 exercises in this chapter. Complete any that interest you, and then proceed to the other exercises below.

  2. The code below creates a boxplot of arrival delay for each airline carrier in the flights data set which is part of the library(nycflights13) package we explored earlier in the semester. Use factors to reorder the name variable by the median of arr_delay. Try ordering things to be smallest to largest, and then largest to smallest. [Hint: you’ll have to deal with an NA values in a meaningful way.]

    library(tidyverse)
    library(nycflights13)
    
    my_flights <- flights %>% 
      left_join(airlines)
    my_flights %>% 
      ggplot(aes(x = name, y = arr_delay)) +
      geom_boxplot() +
      coord_cartesian(ylim = c(-40,40)) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
  3. Head back to a previous task where factors would help you control the display of your categorical variables (such as 2.10, 3.2, 3.8, 3.10, 3.11), and then add some code to practice using factors.

4.9 Task: Density Plots

Sometimes when we make a point plot of individual observations, there are too many points to make the plot useful. In this task, we’ll explore several alternatives. This stack overflow post has a great discussion around the topic, with lots of examples to explore. We’ll focus our observations on a few data sets we’ve encountered before.

  1. Consider the car wash sales data we just explored (which we need to shift the time zone on to match Rexburg). We can construct a point plot, jitter plot, or a density plot, to help us understand different things about the data. Please construct each plot below, and come up with a question that each plot could help you address.

    carwash <- read_csv("https://byuistats.github.io/M335/data/carwash.csv") %>% 
      mutate(time = with_tz(time,tzone = "America/Denver"))
    
    carwash %>% 
      ggplot(aes(x = time, y = name)) +
      geom_point()
    
    carwash %>% 
      ggplot(aes(x = time, y = name)) +
      geom_jitter(size = 0.5)
    
    carwash %>% 
      ggplot(aes(x = time, y = name)) +
      geom_bin_2d()
    
    carwash %>% 
      ggplot(aes(x = time, y = name)) +
      geom_bin_2d() +
      geom_jitter(size = 0.5, color = "white")
    
    times <- carwash %>% pull(time)
    days <- as.numeric(max(times)-min(times)) %>% round(0)
    carwash %>% 
      ggplot(aes(x = time, y = name)) +
      geom_bin_2d(bins = days)
    carwash %>% 
      ggplot(aes(x = time, y = name)) +
      geom_bin_2d(bins = days/7)
    
    carwash %>% 
      ggplot(aes(x = wday(time, label = TRUE), y = name)) +
      geom_jitter(size = 0.1)
    
    carwash %>% 
      ggplot(aes(x = wday(time, label = TRUE), y = name)) +
      geom_bin_2d()
    
    carwash %>% 
      ggplot(aes(x = wday(time, label = TRUE), y = name)) +
      geom_bin_2d() +
      geom_jitter(size = 0.1, color = "white")
    
    carwash %>% 
      mutate(time = update(time, year = 2020, month = 2, mday = 2)) %>% 
      ggplot(aes(x = time, y = name)) +
      geom_jitter(size = 0.1)
    
    carwash %>% 
      mutate(time = update(time, year = 2020, month = 2, mday = 2)) %>% 
      ggplot(aes(x = time, y = name)) +
      geom_bin_2d(binwidth = 60*60) +
      geom_jitter(color = "white", size = 0.1)
    
    carwash %>% 
      mutate(time = update(time, year = 2020, month = 2, mday = 2)) %>% 
      ggplot(aes(x = time, y = name)) +
      geom_bin_2d(binwidth = 60*60)

    Are there things that geom_bin_2d can help you see that geom_jitter does not? Are there things that a geom_jitter can help you see that geom_bin_2d does not? Is it better to include both, or just one?

  2. Let’s return to the flight data in the nycflights13 package. In that dataset, there are over 300,000 flights. Can we use the scheduled arrival time to predict how long the arrival delay might be? The following lines of code are all designed to address that problem in different ways. Run each line of code, and make a list of any questions or concerns you have about each plot.

    library(nycflights13)
    my_flights <- 
      flights %>%
      select(sched_arr_time, arr_delay) %>% 
      glimpse()
    
    #This plot may take a long time to generate, as there are over 300,000 points.
    my_flights %>% 
      ggplot(aes(x=sched_arr_time,y=arr_delay))+
      geom_point()
    
    #Can you tell what the most common scheduled arrival time is?
    my_flights %>% 
      ggplot(aes(x=sched_arr_time,y=arr_delay))+
      geom_bin_2d(bins = 24)
    
    #For flights that arrive more than 60 minutes late, 
    # what's the most common scheduled arrival time?
    my_flights %>% 
      filter(arr_delay>60) %>% 
      ggplot(aes(x=sched_arr_time,y=arr_delay))+
      geom_bin_2d(bins = 24)
    
    #Double the number of bins, to examine every half hour. 
    #What problems do you see?
    my_flights %>% 
      filter(arr_delay>60) %>% 
      ggplot(aes(x=sched_arr_time,y=arr_delay))+
      geom_bin_2d(bins = 48)
    
    #What does the code below do? Why is something like this needed?
    my_flights %>% 
      separate(sched_arr_time, into = c("hour","min"), sep = -2,remove = FALSE) %>% 
      mutate(sched_arr_time_lub = hm(str_c(hour,min,sep = " "))) %>% 
      filter(arr_delay>60) %>% 
      ggplot(aes(x=as.numeric(sched_arr_time_lub, "hours"),y=arr_delay))+
      geom_bin_2d(bins = 48)
  3. The riem package lets us gather temperature data from weather stations during a given time period. What does the following code do, and what message(s) can you extract from the visualization?

    temps <- riem::riem_measures(station = "RXE", date_start = today()-years(2), date_end = today() )
    temps
    temps %>% 
      mutate(valid = with_tz(valid, tzone = "America/Denver")) %>% 
      mutate(date = date(valid)) %>% 
      filter(month(valid) %in% c(1,12)) %>% 
      mutate(hour = hour(valid)) %>% 
      drop_na(tmpf) %>% 
      group_by(date,hour) %>% 
      summarise(tmpf = mean(tmpf)) %>% 
      ggplot(aes(x = hour, y = tmpf)) +
      geom_bin_2d(bins = 24) 
  4. Case study 2 asked us to look for target audiences where large numbers of gun death occurred. Create several density plots to help address this question. Here is the code to read in the data.

    read_csv("https://github.com/fivethirtyeight/guns-data/blob/master/full_data.csv?raw=true")
  5. In the next case study, case study 6, we’ll be adding a few more business to the carwash data we previously explored. Read in the data from https://byuistats.github.io/M335/data/sales.csv. Then explore the data. What can jitter and density (bin_2d) plots help you learn? Do you have the correct time zone?

4.10 Task: Begin Case Study 6

Each case study throughout the semester will ask you to demonstrate what you’ve learned in the context of an open ended case study. The individual prep activity will always include a “being” reading which will lead to an in class discussion.

  1. Complete the Being Reading section of Case Study 6. This will require that you read an article(s), and come to class with two or three things to share.

  2. Download the data for the case study and begin exploring it. Your goal at this point is to simply to understand the data better. As needed, make charts and or summary tables to help you explore the data. Be prepared to share any questions you have about the data with your team.

  3. Identify what visualizations this case study wants you to create. For each visualization, construct a paper sketch, listing the aesthetics (x, y, color, group, label, etc.), the geometries, any faceting, etc., that you want to appear in your visualization. Then construct on paper an outline of the table of data that you’ll need to create said visualization. You do not need to actually perform any computations in R, rather create rough hand sketches of the visualization and table that you can share with your team in class.

  4. With any remaining time, feel free to start transferring your plans into code, and perform the wrangling and visualizing with R.

4.11 Task: Improve a Previous Task

  1. Choose an old task you got stuck on, or would like to develop further. Spend this prep time improving the old task.
  2. Push your work to GitHub so you can share it with your classmates.
  3. Be prepared to share something new you learned, or something you still have questions about.

4.12 Task: Finish Case Study 5

Finish Case Study 5 and submit it. There is no other preparation task. The goal is to free up time to allow you to focus on and complete the case study.

4.13 Task: Finish Case Study 6

Finish Case Study 6 and submit it. There is no other preparation task. The goal is to free up time to allow you to focus on and complete the case study.

4.14 Task: Practice Coding Challenge

  1. Find a place where you can focus for 1 hour on the coding challenge. Make sure you have everything you need (laptop charger, water bottle, etc.)
  2. When you are ready begin, go to our class I-Learn page for the instructions. You will submit your work in I-Learn.
  3. During class, we’ll have a discussion about the coding challenge.