Let’s get our JSON files into Python.
Looking at cars
# Cars
url_cars = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"
cars = pd.read_json(url_cars)
The flight project data
# the long way to help us understand json files and
url_flights = 'https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json'
http = urllib3.PoolManager()
response = http.request('GET', url_flights)
flights_json = json.loads(response.data.decode('utf-8'))
flights = pd.json_normalize(flights_json)
Handling Missing Data
What is missing
in pandas?
Be careful with .dropna()
Let’s use the pandas example with a little extra.
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman', np.nan],
"toy": [np.nan, 'Batmobile', 'Bullwhip',np.nan],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT, pd.NaT],
"power": [np.nan, np.nan, np.nan, np.nan]})
When would we ever use df.dropna()
?
Almost never! Why do you think it is a bad idea?
df.dropna()
dropna()
with arguments
What argument do we use to drop rows where all values are NA
?
df.dropna(<ARGUMENTS>)
What if we want to drop NA
rows based on one column?
Replacing missing values
Figuring out
fillna()
What if we want to replace all the NA
values with the mean weight in the wt
column of the cars data?
What if we want to replace all the 999
with 4
in the cars data?
Handling the NaN’s
How do we handle the non nan
missing
Interpolating interpolate()
What if we want to replace all the NA
values with a linear interpolation?
s = pd.Series([0, 1, np.nan, 3])
s2 = pd.Series([0, 1, np.nan, 3, np.nan, 8, np.nan, 6])
The flights data
Handling the non nan
missings
Careful to handle the missing Late Aircraft data correctly
- Let’s list what values are being used to represent missing.
What columns do we need to use for question 4 (total number of flights delayed by weather)?
num_of_delays_late_aircraft
num_of_delays_nas
Handling the missing months
- How many rows have missing months?
flights.month.value_counts()
Can we figure out any patterns in the missingness?