It’s fair to say that through the last decade as most people prepare to see a feature film, either at home or at the local theater, they often consult an online movie rating source to find out if the movie in question is good or not. As they scour through the various websites and blogs they generally come across, and enter, the IMDb website.
IMDb which started in the 90’s has accumulated and quantified almost every bit of information available on movies and tv shows from the past century. Through time it’s become a reputable source for all entertainment information, and continues to be a standard for free information sharing for the public to access. Interestingly what I personally use IMDb for the most, is their ratings on movies and shows. I search and read what these reviewers think about these movies and shows so that I know what to expect. I’m not alone in this.
Because the team at IMDb has built and extensive list and data set of most movies made in the past century, questions can answered using the data recorded for each movie. In this analysis the IMDb data set will be used to see of the Age
rating of each movie and the decade that each movie was made are associated or not.
Does the total number of each Age
rating (all, 7+, 13+, 16+, 18+) in a decade change in the past three Decades
? Or, in other words, has the entertainment industry pumped out more or less movies of a certain age rating (associated) or are they the same decade to decade? (not associated)
\[ H_0: \text{The movie Age rating and the movie Decade are independent of one another.} \] \[ H_0: \text{The movie Age rating and the movie Decade are associated.} \] \[ \text{level of significance: } a = .05 \]
Below are the total counts of each Age
count per Decade
. The decades in question are the past three, the nineties, two-thousands, and the most recent decade the twenty-tens. Looking at the table it’s apparent that there were a larger total sum of movies made in the past decade than the other two, so the question will become how much have they all changed? This test will look into the difference between these observed counts and the expected counts for each cell.
#Filter area
decades <- filter(mov, Year >= 1990)
decades <- decades %>% drop_na(Age)
decades <- decades %>%
mutate(
Age = factor(Age, levels= c("all", "7+", "13+", "16+", "18+"), ordered= TRUE),
Decade = case_when(
Year >= 2010 ~ 'Twenty-Tens',
Year >= 2000 & Year < 2010 ~ 'Two-Thousands',
Year >= 1990 & Year < 2000 ~ 'Nineties'
),
Decade = factor(
Decade, levels= c('Nineties', 'Two-Thousands', 'Twenty-Tens')), ordered= TRUE
)
decades <- table(decades$Age , decades$Decade)
pander(decades)
Nineties | Two-Thousands | Twenty-Tens | |
---|---|---|---|
all | 83 | 225 | 305 |
7+ | 124 | 275 | 724 |
13+ | 96 | 303 | 768 |
16+ | 7 | 37 | 272 |
18+ | 340 | 804 | 1770 |
In this bar plot of the data table the trends are much more obvious. From a glance, there seems to be double the number of each group each decade that has passed in most groups. The exception is that all
movies and 16+
movies seem to avoid this doubling trend. This may be due to those ratings being less common than the others. But a Chi-Squared test would support these findings statistically.
barplot(decades, col= brewer.pal(n=5, name= 'RdGy'),beside=TRUE, legend.text=TRUE, main = "Age Ratings by Decade", xlab = "Decade", ylab = "Count of Movies", args.legend=list(x = "topleft", bty="n"))
Nineties | Two-Thousands | Twenty-Tens | |
---|---|---|---|
all | 83 | 225 | 305 |
7+ | 124 | 275 | 724 |
13+ | 96 | 303 | 768 |
16+ | 7 | 37 | 272 |
18+ | 340 | 804 | 1770 |
To try and reject the \(H_0\) of our test we need to determine a \(X^2\) test statistic and an associated p-value. The test on Decade
and Age
rating is as follows:
Test statistic | df | P value |
---|---|---|
136.1 | 8 | 1.499e-25 * * * |
With a p-value so minuscule there is no doubt that the columns and rows are associated, but to what degree and what does this mean?
When considering the fit and the interpretation of the test, the expected count for each column and row must be greater than five, and while looking below, they all are.
Nineties | Two-Thousands | Twenty-Tens | |
---|---|---|---|
all | 64.97 | 164.3 | 383.7 |
7+ | 119 | 301 | 703 |
13+ | 123.7 | 312.8 | 730.5 |
16+ | 33.49 | 84.71 | 197.8 |
18+ | 308.8 | 781.1 | 1824 |
This is again the observed counts for every group. Some of which are close to one another and others pretty distant.
Nineties | Two-Thousands | Twenty-Tens | |
---|---|---|---|
all | 83 | 225 | 305 |
7+ | 124 | 275 | 724 |
13+ | 96 | 303 | 768 |
16+ | 7 | 37 | 272 |
18+ | 340 | 804 | 1770 |
Knowing that the p-value is as low as it is it would behoove us to look at the residuals to see how far the expected counts were from the observed counts. This is the distance of each value of the above two values. The higher further the value is from \(0\) the the larger the difference. These Pearson residuals are calculated by finding the (observed - expected)/sqrt(expected) of each cell. For reference this looks like: \(\frac{(O_i-E_i)}{\sqrt{E_i}}\).
Out of all the Pearson residuals below there are some that are higher than expected, and quite drastically so. The all
age rating category has some of highest residuals. With values ranging between -4.018 and 4.734 it’s obvious that the count changed drastically from decade to decade. The other age rating that changed drastically with a range of -4.578 and 5.276 was the 16+
category. These two groups were assumed to make the test statistic as high as it is because of its residual variability. The other three groups however don’t have much variability, thay range from around -2.578 and 1.773. Which is significantly less than the aforementioned groups.
Nineties | Two-Thousands | Twenty-Tens | |
---|---|---|---|
all | 2.237 | 4.734 | -4.018 |
7+ | 0.4565 | -1.5 | 0.7939 |
13+ | -2.489 | -0.5554 | 1.388 |
16+ | -4.578 | -5.183 | 5.276 |
18+ | 1.773 | 0.8186 | -1.265 |
The sum of the squares of all the residual differences was enough to produce a p-value of 1.499e-25 which is a massive indicator that the groups are associated. As the results currently stand we can reject the \(H_0\) and claim that Age
and Decade
are associated. This means that as decades pass the number of movies with each Age
rating also changes. This means that the media industry is changing and will always change as far as it’s standards are concerned. Trends do change though, although the number Age
ratings fluctuated from decade to decade there isn’t a complete pattern that can predict what would happen in decades to come. All that is known is the count changes.