Day 8: Using pandas to handling missingness.

Let’s get our JSON files into Python.

Looking at cars

# Cars
url_cars = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"
cars = pd.read_json(url_cars)

The flight project data

# the long way to help us understand json files and 
url_flights = 'https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json'
http = urllib3.PoolManager()
response = http.request('GET', url_flights)
flights_json = json.loads(response.data.decode('utf-8'))
flights = pd.json_normalize(flights_json)

Handling Missing Data

What is missing in pandas?

Be careful with .dropna()

Let’s use the pandas example with a little extra.

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman', np.nan],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip',np.nan],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),
                            pd.NaT, pd.NaT],
                    "power": [np.nan, np.nan, np.nan, np.nan]})

When would we ever use df.dropna()?

Almost never! Why do you think it is a bad idea?

df.dropna()

dropna() with arguments

What argument do we use to drop rows where all values are NA?

reference

df.dropna(<ARGUMENTS>)

What if we want to drop NA rows based on one column?

reference

Replacing missing values

Figuring out fillna()

What if we want to replace all the NA values with the mean weight in the wt column of the cars data?

reference

What if we want to replace all the 999 with 4 in the cars data?

reference

Handling the NaN’s

How do we handle the non nan missing

Interpolating interpolate()

What if we want to replace all the NA values with a linear interpolation?

s = pd.Series([0, 1, np.nan, 3])
s2 = pd.Series([0, 1, np.nan, 3, np.nan, 8, np.nan, 6])

reference

The flights data

Handling the non nan missings

Careful to handle the missing Late Aircraft data correctly

  • Let’s list what values are being used to represent missing.

What columns do we need to use for question 4 (total number of flights delayed by weather)?

  • num_of_delays_late_aircraft
  • num_of_delays_nas

Handling the missing months

  • How many rows have missing months?
flights.month.value_counts()