38  Data Search

These readings are from the book Build a Career in Data Science. If the links below don’t work for you, you can get an electronic version of the book from the BYU-Idaho library.

Optional:

There are lots of sites that recommend other sites for finding free data, here is one of them: Finding data to answer your question. A couple of other fun places include

  • Data is Plural is a newsletter that sends you interesting datasets every week. Scroll through the archive and see if any topics jump out at you!

  • Tidy Tuesday is a community of R users that explore and visualize a new dataset every Tuesday. This GitHub account contains every dataset used in Tidy Tuesday. Click on the “data” folder and then pick a year to start exploring.

Finding good data takes time, and can take longer than the time to tidy your data. This task could easily take three to six hours to find the data you need for your semester project. You may not finish this task this week, but you should definitely start it so that you are not rushed at the end of the semester. After you find good, potential data sources pick one to focus on working with for this task.

Don’t be afraid to pivot or change directions if you find a dataset that sparks your interest or cannot find data to answer your initial question.

Create an .qmd file or R script that has links to data sources with a description of the quality of each.

  1. Check out 3-5 potential data sets/sources (that are free). Choose one dataset to focus on for this assignment (later you may bring in the other sources or switch to another source)
  2. Of the 3-5 potential data sources you found, pick 1 to focus on for this task. Build an R script or markdown file that reads in, formats, and visualizes the data using the principles of exploratory analysis.
Warning

CAUTION: If you have to download your data (as opposed to read it directly from the web) you may want to store your data outside of the project folder so it doesn’t try to upload to GitHub. There is a space limit on free GitHub and you cannot push large datasets there. There are also other ways to make Git ignore your data.

  1. Create two to three quick visualizations that you used to check the quality of your data. (Keep the code for at least one of them.)

  2. Think through limitations of the compiled data in addressing your original question.

    1. Furthermore, you should be thinking about and begin to address any follow-on or alternate questions that you could use for your project.
  3. Push your R script or markdown files to your GitHub repository.

Submit

In I-learn submit a link to the script or .md file on GitHub.