ConsultingChi-squared.utf8

Background

It’s fair to say that through the last decade as most people prepare to see a feature film, either at home or at the local theater, they often consult an online movie rating source to find out if the movie in question is good or not. As they scour through the various websites and blogs they generally come across, and enter, the IMDb website.

IMDb which started in the 90’s has accumulated and quantified almost every bit of information available on movies and tv shows from the past century. Through time it’s become a reputable source for all entertainment information, and continues to be a standard for free information sharing for the public to access. Interestingly what I personally use IMDb for the most, is their ratings on movies and shows. I search and read what these reviewers think about these movies and shows so that I know what to expect. I’m not alone in this.

Because the team at IMDb has built and extensive list and data set of most movies made in the past century, questions can answered using the data recorded for each movie. In this analysis the IMDb data set will be used to see of the Age rating of each movie and the decade that each movie was made are associated or not.

Hypotheses

Question

Does the total number of each Age rating (all, 7+, 13+, 16+, 18+) in a decade change in the past three Decades? Or, in other words, has the entertainment industry pumped out more or less movies of a certain age rating (associated) or are they the same decade to decade? (not associated)

Hypotheses

\[ H_0: \text{The movie Age rating and the movie Decade are independent of one another.} \] \[ H_0: \text{The movie Age rating and the movie Decade are associated.} \] \[ \text{level of significance: } a = .05 \]

Table

Totals from each group

Below are the total counts of each Age count per Decade. The decades in question are the past three, the nineties, two-thousands, and the most recent decade the twenty-tens. Looking at the table it’s apparent that there were a larger total sum of movies made in the past decade than the other two, so the question will become how much have they all changed? This test will look into the difference between these observed counts and the expected counts for each cell.

#Filter area
decades <- filter(mov, Year >= 1990)
decades <- decades %>% drop_na(Age)
decades <- decades %>% 
  mutate(
    Age = factor(Age, levels= c("all", "7+", "13+", "16+", "18+"), ordered= TRUE),
    Decade = case_when(
      Year >= 2010 ~ 'Twenty-Tens',
      Year >= 2000 & Year < 2010 ~ 'Two-Thousands',
      Year >= 1990 & Year < 2000 ~ 'Nineties'
    ),
    Decade = factor(
      Decade, levels= c('Nineties', 'Two-Thousands', 'Twenty-Tens')), ordered= TRUE
    )
decades <- table(decades$Age , decades$Decade)
pander(decades)

	Nineties	Two-Thousands	Twenty-Tens
all	83	225	305
7+	124	275	724
13+	96	303	768
16+	7	37	272
18+	340	804	1770

Graph

In this bar plot of the data table the trends are much more obvious. From a glance, there seems to be double the number of each group each decade that has passed in most groups. The exception is that all movies and 16+ movies seem to avoid this doubling trend. This may be due to those ratings being less common than the others. But a Chi-Squared test would support these findings statistically.

barplot(decades, col= brewer.pal(n=5, name= 'RdGy'),beside=TRUE, legend.text=TRUE, main = "Age Ratings by Decade", xlab = "Decade", ylab = "Count of Movies", args.legend=list(x = "topleft", bty="n"))

pander(decades, caption="Totals")

Totals
	Nineties	Two-Thousands	Twenty-Tens
all	83	225	305
7+	124	275	724
13+	96	303	768
16+	7	37	272
18+	340	804	1770

\(X^2\) Test

Statistical Test

To try and reject the \(H_0\) of our test we need to determine a \(X^2\) test statistic and an associated p-value. The test on Decade and Age rating is as follows:

KingCHI <- chisq.test(decades)
pander(KingCHI)

Pearson’s Chi-squared test: `decades`
Test statistic	df	P value
136.1	8	1.499e-25 * * *

With a p-value so minuscule there is no doubt that the columns and rows are associated, but to what degree and what does this mean?

Interpretation

Expected Counts

When considering the fit and the interpretation of the test, the expected count for each column and row must be greater than five, and while looking below, they all are.

pander(KingCHI$expected)

	Nineties	Two-Thousands	Twenty-Tens
all	64.97	164.3	383.7
7+	119	301	703
13+	123.7	312.8	730.5
16+	33.49	84.71	197.8
18+	308.8	781.1	1824

Observed Counts

This is again the observed counts for every group. Some of which are close to one another and others pretty distant.

pander(decades)

	Nineties	Two-Thousands	Twenty-Tens
all	83	225	305
7+	124	275	724
13+	96	303	768
16+	7	37	272
18+	340	804	1770

Residuals

Knowing that the p-value is as low as it is it would behoove us to look at the residuals to see how far the expected counts were from the observed counts. This is the distance of each value of the above two values. The higher further the value is from \(0\) the the larger the difference. These Pearson residuals are calculated by finding the (observed - expected)/sqrt(expected) of each cell. For reference this looks like: \(\frac{(O_i-E_i)}{\sqrt{E_i}}\).

Out of all the Pearson residuals below there are some that are higher than expected, and quite drastically so. The all age rating category has some of highest residuals. With values ranging between -4.018 and 4.734 it’s obvious that the count changed drastically from decade to decade. The other age rating that changed drastically with a range of -4.578 and 5.276 was the 16+ category. These two groups were assumed to make the test statistic as high as it is because of its residual variability. The other three groups however don’t have much variability, thay range from around -2.578 and 1.773. Which is significantly less than the aforementioned groups.

pander(KingCHI$residuals)

	Nineties	Two-Thousands	Twenty-Tens
all	2.237	4.734	-4.018
7+	0.4565	-1.5	0.7939
13+	-2.489	-0.5554	1.388
16+	-4.578	-5.183	5.276
18+	1.773	0.8186	-1.265

The sum of the squares of all the residual differences was enough to produce a p-value of 1.499e-25 which is a massive indicator that the groups are associated. As the results currently stand we can reject the \(H_0\) and claim that Age and Decade are associated. This means that as decades pass the number of movies with each Age rating also changes. This means that the media industry is changing and will always change as far as it’s standards are concerned. Trends do change though, although the number Age ratings fluctuated from decade to decade there isn’t a complete pattern that can predict what would happen in decades to come. All that is known is the count changes.

Decade of Movies Chi-Squared Test

Background

Hypotheses

Question

Hypotheses

Table

Totals from each group

Graph

\(X^2\) Test

Statistical Test

Interpretation

Expected Counts

Observed Counts

Residuals