Visualizing Large Distributions

Background

Before we can start to answer business questions we need to become familiar with our data. Often, you will want to start with the data dictionary (What is a Data Dictionary?). However, you can also just dive into the data and gain an understanding based on the variable names and types.

Beyond the variable descriptions is how the variables relate to each other. We can create tables or visualizations that summarize how different variables relate to each other. At this point, we are deepening our understanding as well as beginning our analysis.

Remember: Your job is to become the data expert, not the domain expert. You will build domain skills but you are not going to replace domain experts. People will depend on you to have a firm understanding of what data your company has available to answer domain-specific questions.

Use nycflights13::flights to practice your data summary and data investigation through visualization skills.

Tasks

  • Create a new .Rmd to do this task
  • Pick two variables (columns) whose relationship you would like to explore
    • Provie a visualization of the univariate distribution of each of the selected variables separately (i.e. 2 plots are needed here, 1 for each variable)
    • Build bivariate summaries of the variables you have chosen to investigate (1 plot is needed here, the plot should contain both variables in it)
      • How deep can we go
        • Number4
  • Write one to two paragraphs in the .Rmd summarizing insights from your graphics and your data presentation choices
  • Knit your .Rmd and then push your .Rmd, .md, and .html into your git repository.