Project 2: Late Flights & Missing Data (JSON)
Walkthrough
Background
We will complete six projects during the semester that each take about two weeks (four days of class). On average, a student will spend 2 hours outside of class per hour in class to complete the assigned readings, submit any Canvas items, and complete the project (for a total of 8 hours per project). The instruction for each project will be structured into sections as written on this page.
This first Background section provides context for the project. Make sure you read the background carefully to see the big picture needs and purpose of the project.
Delayed flights are not something most people look forward to. In the best case scenario you may only wait a few extra minutes for the plane to be cleaned. However, those few minutes can stretch into hours if a mechanical issue is discovered or a storm develops. Arriving hours late may result in you missing a connecting flight, job interview, or your best friend’s wedding.
In 2003 the Bureau of Transportation Statistics (BTS) began collecting data on the causes of delayed flights. The categories they use are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft, and Security. You can visit the BTS website to read definitions of these categories.
Client Request
The JSON file for this project contains information on delays at 7 airports over 10 years. Your task is to clean the data, search for insights about flight delays, and communicate your results to the Client. The Client is a CEO of a flight booking app who is interested in the causes of flight delays and wants to know which airports have the worst delays. They also want to know the best month to fly if you want to avoid delays of any length.
Data
Every data science project should start with data, and our class projects are no different. Each project will have ‘URL’ and ‘Information’ links like the ones below. Right click the ‘URL’ link and select “Copy Link” to use it to import the data into your project. This is the preferred method to get data into your report as you will be publishing your report to GitHub. If you choose to download the data file to your computer you will need to save it in the same folder as your project#.qmd
file for it to work correclty in GitHub.
URL: JSON File
Information: Data Description
Subject Matter: Types of Delay
Readings
The Readings section will contain links to reading assignments that are required for each project, as well as optional references. Remember that you are reading this material to build skills. Take the time to comprehend the readings and the skills contained within.
We recommend reading through the assigned material once for a general understanding before the first day of each project. You will reread and reference the material multiple times as you complete the project.
- P4DS: CH4 Data Transformation (Read)
- P4DS: CH6 Tidy Data (Read)
- P4DS: CH11 Visualization (Read)
- P4DS: CH12 Layers (Skim)
- P4DS: CH13 Exploratory Data Analysis (Skim)
- P4DS: CH21 Missing Values (Read)
- P4DS: Ch25.3 JSON (Read)
Optional References
- Python Data Science Handbook: Missing Data
- Handling Missing Data
- Wikipedia Missing Data
- isin method
- where method
- np.where method
- replace method
- An introduction to JSON (May need to open in ingognito to read.)
- The key word in ‘Data Science’ is not Data…
- How to Handle Missing Data (May need to open in ingognito to read.)
- Lambda Function
Questions and Tasks (Core)
This section lists the questions and tasks that need to be completed for the project. Your work on the project must be compiled into a report, pushed to GitHub and a URL submitted in Canvas by the weekend following the last day of material for the project.
There are two types of questions: Core and Stretch. Core questions are required for each project. The course syllabus competencies requires specic a number of projects having all the Stretch questions achived based on your goals for the grade level you are seeking.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the
Month
variable.)According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
- 100% of delayed flights in the Weather category are due to weather
- 30% of all delayed flights in the Late-Arriving category are due to weather
- From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
- 100% of delayed flights in the Weather category are due to weather
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
Questions and Tasks (Stretch)
Here is an example Stretch question(s) for this project. Your instructor may assign different Stretch question(s). You must comment in Canvas when submitting your project if you completed any of the Stretch questions.
- Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
Submission:
When you have completed the report, you will need to follow this process to submit your work:
- Have the Course Work Portfolio open in VS Code and open
Projects/Project0.qmd
- Click
Preview Button
in VS Code in the top right of the screen- This will render the project but also entire course work portfolio into
HTML
files for review - Confirm everything displas as you would like it to
- How you see it will be how it is viewed for grading
- If there is an error in any cell of the quarto files, the rendering will stop and you will need to fix the error before rendering again (if you get stuck post your error in Slack)
- This will render the project but also entire course work portfolio into
- Once the report is confirmed close the preview and open the
GitHub Desktop
application - Confirm you are in the correct repository in the top left corner of the screen
- Confirm you are on the correct branch
Main
in the top left corner of the screen (Never change off theMain
branch) - Type a summary of the changes in the
Summary
box - Click
Commit to main
blue button in the bottom left corner - Click
Push origin
blue button in the middle right of the screen- This will push all your changes in the project .qmd file to GitHub
- The publish.yml file will kick off an automated process to render the project into HTML files
- The HTML files will be published to GitHub pages in the gh-pages branch
- The URL to the published project will be in the deployment section in GitHub
- In
GitHub Desktop
clickOpen in GitHub
to navigete to the repository - Click on the
Actions
tab and make sure there were no errors in the rendering process - Click on the
deployment
section of the main page of the repository to find the URL - Navigate to the URL and confirm it displays as you intended
- Copy the URL and submit it in Canvas
- In
Deliverables:
Deliverables are “the quantifiable goods or services that must be provided upon the completion of a project”. In this class the deliverable for each project is a GitHub published report created using Quarto files. This final section will be the same for each project.
Use this template to submit your Client Report. The template has two sections:
- A short elevator pitch that highlights key values or metrics from the results. Describing these key insights to interest or hook the reader to want to read more about your work. The writing style should be more technical with some creative elements. Do not summarize what you did.
- Answers to the questions | tasks. Each should include a written description of your results, code cells with comments, charts and/or tables.
- A short summary of work must be submitted in the comments in Canvas wwhen you submit the URL. Rate your own work on a scale of 1-5. 1 being poor and 5 being excellent. Include a short description of why you rated your work the way you did.
Your report should be written in quarto markdown files and pushed to GitHub. Submit a URL of the rendered project in Canvas. (Do not submit the URL to the GitHub .qmd
file)
Feedback:
You will recieve feedback and/or coaching notes in the form of a GitHub issue. You will need to address the feedback, re-render and resubmit the project, and mark the GitHub issue as closed.
Resubmission:
You will have one opportunity to resubmit the project after you have received feedback. The window for the resubmission will be open through the Wednesday following the due date of the project. Therefore it is recomended that you turn in a draft of the project early on the Thursday before the due date to ensure you have time to address any feedback and resubmit the project. It is acceptable to turn in a draft that is only 80% complete. This will allow you to get feedback on the majority of the project and then focus on the final details. The closer to that Thursday you turn in the draft the more feedback and coaching you will recieve.