Background
Delayed flights are not something most people look forward to. In the best case scenario you may only wait a few extra minutes for the plane to be cleaned. However, those few minutes can stretch into hours if a mechanical issue is discovered or a storm develops. Arriving hours late may result in you missing a connecting flight, job interview, or your best friend’s wedding.
In 2003 the Bureau of Transportation Statistics (BTS) began collecting data on the causes of delayed flights. The categories they use are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft, and Security. You can visit the BTS website to read definitions of these categories.
This json file for this project contains information on delays at 7 airports over 10 years. Your task is to clean the data, search for insights about flight delays, and communicate your results using the provided template. If you have completed the checkpoints for Unit 5, then you are ready to answer the Grand Questions listed below. Refer to the readings for additional help.
Data
Download: JSON File
Information: Data Description
Readings
First Week:
Second Week:
Optional References
- isin method
- where method
- np.where method
- replace method
- An introduction to JSON
- The key word in ‘Data Science’ is not Data…
- How to Handle Missing Data (May need to open in ingognito to read.)
Grand Questions
For Project 2, the answer to each question should include a written response and a chart or table.
Which airport has the worst delays? Discuss how you chose to define “worst”. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
What is the best month to fly if you want to avoid delays of any length? Discuss your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the
Month
variable.)According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the
Late Aircraft
variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:- 100% of delayed flights in the Weather category are due to weather.
- 30% of delayed flights in the Late-Arriving category are due to weather.
- From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, formatted as a JSON file. You example should have at least one missing value.
Deliverables
Use the provided template to submit your case study. The template has three sections:
- A short summary that describes the results of the project and the tools you used. (Think “elevator pitch”.)
- Answers to the grand questions. Each answer should include a written description of your results, and may also include charts or tables.
- An appendix that provides your commented code. Your code comments should justify any decisions you had to make while programming.