JSONs & Missing
Skill Builder
JSONs & Missiong
Data
Exercise 1
Read in the json file as a pandas dataframe. After reading in the data, you’ll want to explore it and gain some intuition. Exploring data is a very important step — the more you know about your data the better! Answer the following questions to gain some insight into this dataset.
- How many rows are there?
- How many columns?
- What does a row represent in this dataset?
- What are the different ways missing values are encoded?
- How many np.nan in each column?
# Object/Categorical Columns
df.column_name.value_counts(dropna=False)
df.column_name.unique()
# Numeric Columns
df.column_name.describe()
# Counting missing values
df.isna().sum() # Creates boolean dataframe and sums each columnExercise 2
After learning different ways our data encodes missing values, now we will neatly manage them. There are many techniques we can use to handle missing values; for example, we can drop all rows that contain a missing value, impute with mean or median, or replace missing values with a new missing category. We will use some of these techniques in this exercise.
shape_reported- replace missing values withmissingstring.distance_reported- change -999 values to np.nan. (-999 is a typical way of encoding missing values.)distance_reported- fill in missing values with the mean (imputation)were_you_abducted- replace-string withmissingstring.
The first 10 rows of your data should look like this after completion of the above steps.
| city | shape_reported | distance_reported | were_you_abducted | estimated_size | |
|---|---|---|---|---|---|
| 0 | Ithaca | TRIANGLE | 8521.9 | yes | 5033.9 |
| 1 | Willingboro | OTHER | 7438.64 | no | 5781.03 |
| 2 | Holyoke | OVAL | 7438.64 | no | 697203 |
| 3 | Abilene | DISK | 7438.64 | no | 5384.61 |
| 4 | New York Worlds Fair | LIGHT | 6615.78 | missing | 3417.58 |
| 5 | Valley City | DISK | 7438.64 | no | 4280.1 |
| 6 | Crater Lake | CIRCLE | 7377.89 | no | 528289 |
| 7 | Alma | DISK | 7438.64 | missing | 4772.75 |
| 8 | Eklutna | CIGAR | 5214.95 | no | 4534.03 |
| 9 | Hubbard | CYLINDER | 8220.34 | missing | 4653.72 |
df.column_name.replace(..., ..., inplace=True)
df.column_name.fillna(..., inplace=True)Exercise 3
Create a table that contains the following summary statistics. - median estimated size by shape - mean distance reported by shape - count of reports belonging to each shape
Your table should look like this:
| shape_reported | median_est_size | mean_distance_reported | group_count |
|---|---|---|---|
| CIGAR | 5899.68 | 6520.21 | 3 |
| CIRCLE | 266002 | 7408.26 | 2 |
| CYLINDER | 4550.58 | 8039.49 | 2 |
| DISK | 4581.8 | 7516.39 | 16 |
| FIREBALL | 5407.22 | 7097.78 | 3 |
| FLASH | 6108.34 | 7438.64 | 1 |
| FORMATION | 5104.4 | 8708.32 | 2 |
| LIGHT | 3850.25 | 7636.09 | 2 |
| OTHER | 4699.4 | 7473.98 | 4 |
| OVAL | 4943.63 | 7787.24 | 4 |
| RECTANGLE | 3668.1 | 6054.62 | 2 |
| SPHERE | 5076.78 | 7206.55 | 6 |
| TRIANGLE | 5033.9 | 8521.9 | 1 |
| missing | 250153 | 7438.64 | 2 |
(df.groupby(...)
.agg(...,
...,
...))Exercise 4
The cities listed below reported their estimated size in square inches, not square feet. Create a new column named estimated_size_sqft in the dataframe, that has all the estimated sizes reported as sqft. (Hint: divide by 144 to go from sqin -> sqft)
- Holyoke
- Crater Lake
- Los Angeles
- San Diego
- Dallas
The head of your data should look like this.
| city | shape_reported | distance_reported | were_you_abducted | estimated_size | estimated_size_sqft | |
|---|---|---|---|---|---|---|
| 0 | Ithaca | TRIANGLE | 8521.9 | yes | 5033.9 | 5033.9 |
| 1 | Willingboro | OTHER | 7438.64 | no | 5781.03 | 5781.03 |
| 2 | Holyoke | OVAL | 7438.64 | no | 697203 | 4841.69 |
| 3 | Abilene | DISK | 7438.64 | no | 5384.61 | 5384.61 |
| 4 | New York Worlds Fair | LIGHT | 6615.78 | missing | 3417.58 | 3417.58 |
| 5 | Valley City | DISK | 7438.64 | no | 4280.1 | 4280.1 |
| 6 | Crater Lake | CIRCLE | 7377.89 | no | 528289 | 3668.68 |
| 7 | Alma | DISK | 7438.64 | missing | 4772.75 | 4772.75 |
| 8 | Eklutna | CIGAR | 5214.95 | no | 4534.03 | 4534.03 |
| 9 | Hubbard | CYLINDER | 8220.34 | missing | 4653.72 | 4653.72 |
np.where(..., # Condition
..., # If condition is true
...) # If condition is falseSee the script.