JSONs & missing (4 days)

UFO Sightings

Data

Link to json file

Exercise 1

Read in the json file as a pandas dataframe. After reading in the data, you’ll want to explore it and gain some intuition. Exploring data is a very important step — the more you know about your data the better! Answer the following questions to gain some insight into this dataset.

  • How many rows are there?
  • How many columns?
  • What does a row represent in this dataset?
  • What are the different ways missing values are encoded?
  • How many np.nan in each column?

Some useful code for exploring data

# Object/Categorical Columns
data.column_name.value_counts(dropna=False)
data.column_name.unique()

# Numeric Columns
data.column_name.describe()

# Counting missing values
data.isna().sum()  # Creates boolean dataframe and sums each column

Exercise 2

After learning different ways our data encodes missing values, now we will neatly manage them. There are many techniques we can use to handle missing values; for example, we can drop all rows that contain a missing value, impute with mean or median, or replace missing values with a new missing category. We will use some of these techniques in this exercise.

  • shape_reported - replace missing values with missing string.
  • distance_reported - change -999 values to np.nan. (-999 is a typical way of encoding missing values.)
  • distance_reported - fill in missing values with the mean (imputation)
  • were_you_abducted - replace - string with missing string.

The first 10 rows of your data should look like this after completion of the above steps.

cityshape_reporteddistance_reportedwere_you_abductedestimated_size
0IthacaTRIANGLE8521.9yes5033.9
1WillingboroOTHER7438.64no5781.03
2HolyokeOVAL7438.64no697203
3AbileneDISK7438.64no5384.61
4New York Worlds FairLIGHT6615.78missing3417.58
5Valley CityDISK7438.64no4280.1
6Crater LakeCIRCLE7377.89no528289
7AlmaDISK7438.64missing4772.75
8EklutnaCIGAR5214.95no4534.03
9HubbardCYLINDER8220.34missing4653.72

Some useful code for filling in missing data

data.column_name.replace(..., ..., inplace=True)
data.column_name.fillna(..., inplace=True)

Exercise 3

Create a table that contains the following summary statistics.

  • median estimated size by shape
  • mean distance reported by shape
  • count of reports belonging to each shape

Your table should look like this:

shape_reportedmedian_est_sizemean_distance_reportedgroup_count
CIGAR5899.686520.213
CIRCLE2660027408.262
CYLINDER4550.588039.492
DISK4581.87516.3916
FIREBALL5407.227097.783
FLASH6108.347438.641
FORMATION5104.48708.322
LIGHT3850.257636.092
OTHER4699.47473.984
OVAL4943.637787.244
RECTANGLE3668.16054.622
SPHERE5076.787206.556
TRIANGLE5033.98521.91
missing2501537438.642

Some useful code for grouping and getting summary statistics

(data.groupby(...)
     .agg(...,
          ...,
          ...))

Exercise 4

The cities listed below reported their estimated size in square inches, not square feet. Create a new column named estimated_size_sqft in the dataframe, that has all the estimated sizes reported as sqft. (Hint: divide by 144 to go from sqin -> sqft)

  • Holyoke
  • Crater Lake
  • Los Angeles
  • San Diego
  • Dallas

The head of your data should look like this.

cityshape_reporteddistance_reportedwere_you_abductedestimated_sizeestimated_size_sqft
0IthacaTRIANGLE8521.9yes5033.95033.9
1WillingboroOTHER7438.64no5781.035781.03
2HolyokeOVAL7438.64no6972034841.69
3AbileneDISK7438.64no5384.615384.61
4New York Worlds FairLIGHT6615.78missing3417.583417.58
5Valley CityDISK7438.64no4280.14280.1
6Crater LakeCIRCLE7377.89no5282893668.68
7AlmaDISK7438.64missing4772.754772.75
8EklutnaCIGAR5214.95no4534.034534.03
9HubbardCYLINDER8220.34missing4653.724653.72

Some useful code to fix the rows reported in sqin

np.where(...,  # Condition
         ...,  # If condition is true
         ...)  # If condition is false