Day 9: Real world replacement.

The flight project data

url_flights = 'https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json'
http = urllib3.PoolManager()
response = http.request('GET', url_flights)
flights_json = json.loads(response.data.decode('utf-8'))
flights = pd.json_normalize(flights_json)

Careful to handle the missing Late Aircraft data correctly

What does question 5 mean?

Fix all of the varied NA types in the data to be consistent and save the file back out in the same format that was provided (this file shouldn’t have the missing values replaced with a value). Include one record example from your exported JSON file that has a missing value (No imputation in this file).

We will tackle saving our file next class.

How do we reorder the Altair axis labels?

Fix the hours so they are in the right order.


days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'] * 4
days.sort()

df=pd.DataFrame({
    'hour': ["9am", "10am", "11am", "12pm"] * 5,
    'dayofweek': days,
    'cnt': [5, 18, 2, 3, 19, 1, 9, 0, 7, 10,
        12, 3, 1, 17, 6, 7, 10, 11, 3, 4]}) 

chart = alt.Chart(df, height=210).mark_rect().encode(
    x=alt.X(
        "hour",
        title="Hour"), 
    y=alt.Y(
        "dayofweek",
        sort=["Mon", "Tue", "Wed", "Thurs", "Fri"],
        title="Day of Week"),
        color=alt.Color(
            "cnt", 
            scale=alt.Scale(
                range=['lightyellow','red']), 
        legend=alt.Legend(title='Count')
    ),
    tooltip=[
        alt.Tooltip("cnt:Q", title="Count")
    ]
)

What columns do we need to use for question 4 (total number of flights delayed by weather)?

  1. Create a barplot showing the proportion of all flights that are delayed by weather at each airport. What do you learn from this graph (Careful to handle the missing Late Aircraft data correctly)?

num_of_delays_nas, num_of_delays_weather, num_of_delays_late_aircraft,

How do we use np.where()?

The big assign

Work with a partner to get the data right for question 3.

weather = flights.assign(
    severe = lambda x: #need to fix missing,
    nodla_nona = lambda x:
    mild_late = lambda x: # need to fix missing,
    mild = np.where(#use isin, 
     # fix missing * proportion, 
     # fix missing * proportion
        ),
    weather = # add up stuff
    percent_weather = # calculate percent weather over total
).filter(['airport_code','month','severe','mild', 'mild_late',
    'weather', 'num_of_delays_total', 'percent_weather'])

Visualizing the flight data

  • What questions from the project can we answer with a chart?
  • What charts could we create that support the answer to grand question 2?
  • What types of charts are available to use with this data?

flights.info()

RangeIndex: 924 entries, 0 to 923
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   airport_code                   924 non-null    object 
 1   airport_name                   924 non-null    object 
 2   month                          924 non-null    object 
 3   year                           901 non-null    float64
 4   num_of_flights_total           924 non-null    int64  
 5   num_of_delays_carrier          924 non-null    object 
 6   num_of_delays_late_aircraft    924 non-null    int64  
 7   num_of_delays_nas              924 non-null    int64  
 8   num_of_delays_security         924 non-null    int64  
 9   num_of_delays_weather          924 non-null    int64  
 10  num_of_delays_total            924 non-null    int64  
 11  minutes_delayed_carrier        872 non-null    float64
 12  minutes_delayed_late_aircraft  924 non-null    int64  
 13  minutes_delayed_nas            893 non-null    float64
 14  minutes_delayed_security       924 non-null    int64  
 15  minutes_delayed_weather        924 non-null    int64  
 16  minutes_delayed_total          924 non-null    int64  
dtypes: float64(3), int64(10), object(4)
memory usage: 122.8+ KB