UFO Sightings
Data
Exercise 1
Read in the json file as a pandas dataframe. After reading in the data, you’ll want to explore it and gain some intuition. Exploring data is a very important step — the more you know about your data the better! Answer the following questions to gain some insight into this dataset.
- How many rows are there?
- How many columns?
- What does a row represent in this dataset?
- What are the different ways missing values are encoded?
- How many np.nan in each column?
Some useful code for exploring data
# Object/Categorical Columns
data.column_name.value_counts(dropna=False)
data.column_name.unique()
# Numeric Columns
data.column_name.describe()
# Counting missing values
data.isna().sum() # Creates boolean dataframe and sums each column
Exercise 2
After learning different ways our data encodes missing values, now we will neatly manage them. There are many techniques we can use to handle missing values; for example, we can drop all rows that contain a missing value, impute with mean or median, or replace missing values with a new missing
category. We will use some of these techniques in this exercise.
shape_reported
- replace missing values withmissing
string.distance_reported
- change -999 values to np.nan. (-999 is a typical way of encoding missing values.)distance_reported
- fill in missing values with the mean (imputation)were_you_abducted
- replace-
string withmissing
string.
The first 10 rows of your data should look like this after completion of the above steps.
city | shape_reported | distance_reported | were_you_abducted | estimated_size | |
---|---|---|---|---|---|
0 | Ithaca | TRIANGLE | 8521.9 | yes | 5033.9 |
1 | Willingboro | OTHER | 7438.64 | no | 5781.03 |
2 | Holyoke | OVAL | 7438.64 | no | 697203 |
3 | Abilene | DISK | 7438.64 | no | 5384.61 |
4 | New York Worlds Fair | LIGHT | 6615.78 | missing | 3417.58 |
5 | Valley City | DISK | 7438.64 | no | 4280.1 |
6 | Crater Lake | CIRCLE | 7377.89 | no | 528289 |
7 | Alma | DISK | 7438.64 | missing | 4772.75 |
8 | Eklutna | CIGAR | 5214.95 | no | 4534.03 |
9 | Hubbard | CYLINDER | 8220.34 | missing | 4653.72 |
Some useful code for filling in missing data
data.column_name.replace(..., ..., inplace=True)
data.column_name.fillna(..., inplace=True)
Exercise 3
Create a table that contains the following summary statistics.
- median estimated size by shape
- mean distance reported by shape
- count of reports belonging to each shape
Your table should look like this:
shape_reported | median_est_size | mean_distance_reported | group_count |
---|---|---|---|
CIGAR | 5899.68 | 6520.21 | 3 |
CIRCLE | 266002 | 7408.26 | 2 |
CYLINDER | 4550.58 | 8039.49 | 2 |
DISK | 4581.8 | 7516.39 | 16 |
FIREBALL | 5407.22 | 7097.78 | 3 |
FLASH | 6108.34 | 7438.64 | 1 |
FORMATION | 5104.4 | 8708.32 | 2 |
LIGHT | 3850.25 | 7636.09 | 2 |
OTHER | 4699.4 | 7473.98 | 4 |
OVAL | 4943.63 | 7787.24 | 4 |
RECTANGLE | 3668.1 | 6054.62 | 2 |
SPHERE | 5076.78 | 7206.55 | 6 |
TRIANGLE | 5033.9 | 8521.9 | 1 |
missing | 250153 | 7438.64 | 2 |
Some useful code for grouping and getting summary statistics
(data.groupby(...)
.agg(...,
...,
...))
Exercise 4
The cities listed below reported their estimated size in square inches, not square feet. Create a new column named estimated_size_sqft
in the dataframe, that has all the estimated sizes reported as sqft. (Hint: divide by 144 to go from sqin -> sqft)
- Holyoke
- Crater Lake
- Los Angeles
- San Diego
- Dallas
The head of your data should look like this.
city | shape_reported | distance_reported | were_you_abducted | estimated_size | estimated_size_sqft | |
---|---|---|---|---|---|---|
0 | Ithaca | TRIANGLE | 8521.9 | yes | 5033.9 | 5033.9 |
1 | Willingboro | OTHER | 7438.64 | no | 5781.03 | 5781.03 |
2 | Holyoke | OVAL | 7438.64 | no | 697203 | 4841.69 |
3 | Abilene | DISK | 7438.64 | no | 5384.61 | 5384.61 |
4 | New York Worlds Fair | LIGHT | 6615.78 | missing | 3417.58 | 3417.58 |
5 | Valley City | DISK | 7438.64 | no | 4280.1 | 4280.1 |
6 | Crater Lake | CIRCLE | 7377.89 | no | 528289 | 3668.68 |
7 | Alma | DISK | 7438.64 | missing | 4772.75 | 4772.75 |
8 | Eklutna | CIGAR | 5214.95 | no | 4534.03 | 4534.03 |
9 | Hubbard | CYLINDER | 8220.34 | missing | 4653.72 | 4653.72 |
Some useful code to fix the rows reported in sqin
np.where(..., # Condition
..., # If condition is true
...) # If condition is false