Day 3: Missing Data

Welcome to class!

Gratitude Journal

Announcements


Questions 1 and 2

What issues are we still running into?


How to work with missing data

What counts as missing data?


How to identify missing data

  • df.isnull().sum()
  • df.describe()
  • df.column.value_counts(dropna=False)
  • pd.crosstab()

Option 1: Remove missing values

Be careful with .dropna(), and make sure you know what it is doing to your data!

Let’s use the pandas example:

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),
                            pd.NaT]})
A: Almost never! Why do you think it is a bad idea? df.dropna()

Option 2: Replacing missing values

Again, let’s use the pandas example:

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list("ABCD"))

Question 3

What columns do we need to use for question 3 (total number of flights delayed by weather)?

  • num_of_delays_weather
  • num_of_delays_late_aircraft
  • num_of_delays_nas
weather = flights.assign(
    severe = #????,
    mild_late = #????,
    mild_nas = np.where(#????),
    total_weather = # add up severe and mild,
).filter(['airport_code','month','severe','mild_late','mild_nas',
    'total_weather', 'num_of_delays_total'])

Other resources for question 3