Munging data

Intro to cleaning movies data

This skill builder focuses on munging (formatting) data into a machine learning ready dataset. We will be using an IMDB Ratings dataset. It contains columns that are categorical. Sklearn cannot handle columns that are strings, so we need to convert these into a numerical representation. We accomplish this by either one hot encoding, label encoding, or taking just one value of the range provided. There are many other ways to represent these columns as numbers, but they are beyond the scope of this course.

Once you’ve converted all columns to numeric, in an intelligent way, you will be asked to recreate a graph using altair. Here is the head of the data you will be working with. Enjoy!

star_rating	content_rating	genre	duration	box_office_rev	major_hit
9.3	R	Crime	142	€1924521976 - €1925521976	no
9.2	R	Crime	175	€177034987 - €178034987	no
9.1	R	Crime	200	€2617541398 - €2618541398	no
9	PG-13	Action	152	€996115723 - €997115723	no
8.9	R	Crime	154	€1172054364 - €1173054364	no

Data

Link to csv file: ...

Exercise 0

Grab the high range value for each movie and put it into a new column called high_range_rev.
- Make sure the data type of this new column is numeric!!
Remove the box_office_rev column from the dataset.

The .str.split() and .astype() methods might be of use! Also, to get the euro sign just copy it from here, €, and put it in your code.

The first 5 rows of the resulting dataframe should look like this

star_rating	content_rating	genre	duration	major_hit	high_range_rev
9.3	R	Crime	142	no	2345444803
9.2	R	Crime	175	no	2182412593
9.1	R	Crime	200	no	1604872807
9	PG-13	Action	152	no	284317976
8.9	R	Crime	154	yes	1791932201

Exercise 1

Convert the major_hit column to 1/0’s. yes -> 1 and no -> 0. Again, there are several ways to accomplish this. Using our old friend np.where is probably the easiest though.

The first 5 rows of the resulting dataframe should like this

star_rating	content_rating	genre	duration	high_range_rev
9.3	R	Crime	142	1925521976
9.2	R	Crime	175	178034987
9.1	R	Crime	200	2618541398
9	PG-13	Action	152	997115723
8.9	R	Crime	154	1173054364

Exercise 2

Convert the content_rating column using label encoding. We’re using label encoding in this case because the movie ratings already have a natural ordering to them. We will replace each rating with a number in it’s natural ascending order.

To be more specific, here is how we will do it.

G: 0
PG: 1
PG-13: 2
R: 3

A dictionary and the .map() method could be useful for this exercise. There are other ways of tackling this problem though. Be creative!

The first 5 rows of the resulting dataframe should look like

star_rating	content_rating	genre	duration	high_range_rev
9.3	3	Crime	142	1925521976
9.2	3	Crime	175	178034987
9.1	3	Crime	200	2618541398
9	2	Action	152	997115723
8.9	3	Crime	154	1173054364

Exercise 3

The last column that we need to take care of is genre. We will use one hot encoding for this. Make sure to ONLY one hot encode the genre column!

A useful function for one hot encoding is pd.get_dummies(). I recommend checking out the documentation.

The resulting dataframe should look like the following example; don’t worry if your high_range_rev column turned into scientific notation—Pandas does this sometimes.

	star_rating	content_rating	duration	high_range_rev	genre_Action	genre_Crime
0	9.3	3	142	1.92552e+09	0	1
1	9.2	3	175	1.78035e+08	0	1
2	9.1	3	200	2.61854e+09	0	1
3	9	2	152	9.97116e+08	1	0
4	8.9	3	154	1.17305e+09	0	1

Exercise 4

Recreate this graph as best you can. You’ll need to use the original data that specifies the actual rating.

After you have completed this skill builder with your team (or on your own) then compare your work to our script

See the script.