The flight project data
url_flights = 'https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json'
http = urllib3.PoolManager()
response = http.request('GET', url_flights)
flights_json = json.loads(response.data.decode('utf-8'))
flights = pd.json_normalize(flights_json)
Careful to handle the missing Late Aircraft data correctly
What does question 5 mean?
Fix all of the varied NA types in the data to be consistent and save the file back out in the same format that was provided (this file shouldn’t have the missing values replaced with a value). Include one record example from your exported JSON file that has a missing value (No imputation in this file).
We will tackle saving our file next class.
How do we reorder the Altair axis labels?
Fix the hours so they are in the right order.
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'] * 4
days.sort()
df=pd.DataFrame({
'hour': ["9am", "10am", "11am", "12pm"] * 5,
'dayofweek': days,
'cnt': [5, 18, 2, 3, 19, 1, 9, 0, 7, 10,
12, 3, 1, 17, 6, 7, 10, 11, 3, 4]})
chart = alt.Chart(df, height=210).mark_rect().encode(
x=alt.X(
"hour",
title="Hour"),
y=alt.Y(
"dayofweek",
sort=["Mon", "Tue", "Wed", "Thurs", "Fri"],
title="Day of Week"),
color=alt.Color(
"cnt",
scale=alt.Scale(
range=['lightyellow','red']),
legend=alt.Legend(title='Count')
),
tooltip=[
alt.Tooltip("cnt:Q", title="Count")
]
)
What columns do we need to use for question 4 (total number of flights delayed by weather)?
- Create a barplot showing the proportion of all flights that are delayed by weather at each airport. What do you learn from this graph (Careful to handle the missing Late Aircraft data correctly)?
num_of_delays_nas
, num_of_delays_weather
, num_of_delays_late_aircraft
,
How do we use np.where()
?
The big assign
Work with a partner to get the data right for question 3.
weather = flights.assign(
severe = lambda x: #need to fix missing,
nodla_nona = lambda x:
mild_late = lambda x: # need to fix missing,
mild = np.where(#use isin,
# fix missing * proportion,
# fix missing * proportion
),
weather = # add up stuff
percent_weather = # calculate percent weather over total
).filter(['airport_code','month','severe','mild', 'mild_late',
'weather', 'num_of_delays_total', 'percent_weather'])
Visualizing the flight data
- What questions from the project can we answer with a chart?
- What charts could we create that support the answer to grand question 2?
- What types of charts are available to use with this data?
flights.info()
RangeIndex: 924 entries, 0 to 923
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 airport_code 924 non-null object
1 airport_name 924 non-null object
2 month 924 non-null object
3 year 901 non-null float64
4 num_of_flights_total 924 non-null int64
5 num_of_delays_carrier 924 non-null object
6 num_of_delays_late_aircraft 924 non-null int64
7 num_of_delays_nas 924 non-null int64
8 num_of_delays_security 924 non-null int64
9 num_of_delays_weather 924 non-null int64
10 num_of_delays_total 924 non-null int64
11 minutes_delayed_carrier 872 non-null float64
12 minutes_delayed_late_aircraft 924 non-null int64
13 minutes_delayed_nas 893 non-null float64
14 minutes_delayed_security 924 non-null int64
15 minutes_delayed_weather 924 non-null int64
16 minutes_delayed_total 924 non-null int64
dtypes: float64(3), int64(10), object(4)
memory usage: 122.8+ KB