Welcome to class!
Gratitude Journal
Announcements
Question 5
Let’s do an example of question 5 using the mtcars
data.
Load packages and data
#%%
import pandas as pd
import numpy as np
import json
url_cars = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"
cars = pd.read_json(url_cars)
Find all the missing values
#%%
# method 1: find "official" null values
# hp, wt, and vs
cars.isnull().sum()
#%%
# method 2: just look at the data
# car, hp, wt, vs, gear
cars.head(10)
#%%
# method 3: look at summaries
# the values in 'gear' look funny
cars.describe()
#%%
# method 4: count up categories
# looks like 4 rows are blank
cars.car.value_counts()
Reformat the missing values
Remember, you need to reformat your missing values to make them consistent!
Reading the examples in the replace
documentation might give you some ideas.
#%%
# There are a lot of functions
# we could use to give the missing values
# a consistent format.
# `replace()` is one of the easiest
# let's change everything to np.nan
cars_new = cars.replace(999, np.nan).replace("", np.nan)
# did we get them all?
cars_new.isnull().sum()
Saving JSON files from a pandas dataframe
You can save a DataFrame as a JSON file like this:
#%%
# save the new data as a json
cars_new.to_json("my_cars_data.json")
The df.to_json() documentation shows us how to change the way the JSON file is organized. (By row? By column? etc.)
This is the format we would like to see in the report:
[
{
"car": "Mazda RX4",
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.62,
"qsec": 16.46,
"vs": 0,
"am": 1,
"gear": 4,
"carb": 4
}
]
And here are the various options:
# %%
# Question 5 wants us to "include one record example"
# in our md report that "has a missing value"
# you can print out a json file like this:
json_data = cars_new.to_json()
print(json_data)
# but that won't look good in our report.
# instead....
#%%
# you can do this.
# in this format, the json file is
# organized/printed by column
json_data = cars_new.to_json()
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent = 4)
print(json_formatted_str)
# %%
# we can change the format of the
# json file using 'orient'
json_data = cars.to_json(orient="split")
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent = 4)
print(json_formatted_str)
# %%
# by table
json_data = cars.to_json(orient="table")
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent = 4)
print(json_formatted_str)
# %%
# by "record" or "row"
json_data = cars.to_json(orient="records")
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent = 4)
print(json_formatted_str)