Welcome to class!
Gratitude Journal
Announcements
Questions 1 and 2
What issues are we still running into?
How to work with missing data
What counts as missing data?
How to identify missing data
df.isnull().sum()
df.describe()
df.column.value_counts(dropna=False)
pd.crosstab()
Option 1: Remove missing values
Be careful with .dropna()
, and make sure you know what it is doing to your data!
Let’s use the pandas example:
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
A: Almost never! Why do you think it is a bad idea?
df.dropna()
A:
df.dropna(how='all')
referenceA:
df.dropna(subset=['toy'])
referenceOption 2: Replacing missing values
Again, let’s use the pandas example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
A:
fillna()
referenceQuestion 3
What columns do we need to use for question 3 (total number of flights delayed by weather)?
num_of_delays_weather
num_of_delays_late_aircraft
num_of_delays_nas
weather = flights.assign(
severe = #????,
mild_late = #????,
mild_nas = np.where(#????),
total_weather = # add up severe and mild,
).filter(['airport_code','month','severe','mild_late','mild_nas',
'total_weather', 'num_of_delays_total'])