Munging data (4 days)

Intro to cleaning movies data

Link to the data

This skill builder focuses on munging (formatting) data into a machine learning ready dataset. We will be using an IMDB Ratings dataset. It contains columns that are categorical. Sklearn cannot handle columns that are strings, so we need to convert these into a numerical representation. We accomplish this by either one hot encoding, label encoding, or taking just one value of the range provided. There are many other ways to represent these columns as numbers, but they are beyond the scope of this course.

Once you’ve converted all columns to numeric, in an intelligent way, you will be asked to recreate a graph using altair. Here is the head of the data you will be working with. Enjoy!

star_ratingcontent_ratinggenredurationbox_office_revmajor_hit
9.3RCrime142€1924521976 - €1925521976no
9.2RCrime175€177034987 - €178034987no
9.1RCrime200€2617541398 - €2618541398no
9PG-13Action152€996115723 - €997115723no
8.9RCrime154€1172054364 - €1173054364no

Data

Link to csv file: ...


Exercise 0

  • Grab the high range value for each movie and put it into a new column called high_range_rev.
    • Make sure the data type of this new column is numeric!!
  • Remove the box_office_rev column from the dataset.

The .str.split() and .astype() methods might be of use! Also, to get the euro sign just copy it from here, €, and put it in your code.

The first 5 rows of the resulting dataframe should look like this

star_ratingcontent_ratinggenredurationmajor_hithigh_range_rev
9.3RCrime142no2345444803
9.2RCrime175no2182412593
9.1RCrime200no1604872807
9PG-13Action152no284317976
8.9RCrime154yes1791932201

Exercise 1

Convert the major_hit column to 1/0’s. yes -> 1 and no -> 0. Again, there are several ways to accomplish this. Using our old friend np.where is probably the easiest though.

The first 5 rows of the resulting dataframe should like this

star_ratingcontent_ratinggenredurationmajor_hithigh_range_rev
9.3RCrime14201925521976
9.2RCrime1750178034987
9.1RCrime20002618541398
9PG-13Action1520997115723
8.9RCrime15401173054364

Exercise 2

Convert the content_rating column using label encoding. We’re using label encoding in this case because the movie ratings already have a natural ordering to them. We will replace each rating with a number in it’s natural ascending order.

To be more specific, here is how we will do it.

  • G: 0
  • PG: 1
  • PG-13: 2
  • R: 3

A dictionary and the .map() method could be useful for this exercise. There are other ways of tackling this problem though. Be creative!

The first 5 rows of the resulting dataframe should look like

star_ratingcontent_ratinggenredurationmajor_hithigh_range_rev
9.33Crime14201925521976
9.23Crime1750178034987
9.13Crime20002618541398
92Action1520997115723
8.93Crime15401173054364

Exercise 3

The last column that we need to take care of is genre. We will use one hot encoding for this. Make sure to ONLY one hot encode the genre column!

A useful function for one hot encoding is pd.get_dummies(). I recommend checking out the documentation.

The resulting dataframe should look like the following example; don’t worry if your high_range_rev column turned into scientific notation—Pandas does this sometimes.

star_ratingcontent_ratingdurationmajor_hithigh_range_revgenre_Actiongenre_Adventuregenre_Animationgenre_Biographygenre_Comedygenre_Crimegenre_Dramagenre_Familygenre_Fantasygenre_Horrorgenre_Mysterygenre_Sci-Figenre_Thrillergenre_Western
09.3314201.92552e+0900000100000000
19.2317501.78035e+0800000100000000
29.1320002.61854e+0900000100000000
39215209.97116e+0810000000000000
48.9315401.17305e+0900000100000000

Exercise 4

Recreate this graph as best you can. You’ll need to use the original data that specifies the actual rating.