Data Wrangling - Application Activity

Personality Revisited

Introduction

The Big 5 personality test is the most widely accepted tool for modelling personality in academic psychology. It is based on decades of statistical analysis of personality descriptions across languages and cultures. The big 5 traits are:

  1. Openness
  2. Conscientiousness
  3. Extroversion
  4. Agreeableness
  5. Neuroticism

Brother Cannon collected personality data on students for the past several semesters, including a few metrics that may be associated with personality traits.

NOTE: Scores for personality traits are given in percentiles relative to the general population.

In this activity, you will practice the process for approaching a dataset outlined in class:

  1. Load the data and libraries
  2. Explore the data and generate hypotheses
  3. Prepare the data for analysis
  4. Perform the appropriate analysis

Data preparation will include using the filter() function. For now, analysis means creating good visualizations that tell a story using ggplot() and base R.

1 - Load the Data and Libraries

Run the following code to load the necessary R libraries and import the big 5 personality data:

library(tidyverse)
library(car)
library(mosaic)
library(rio)

big5 <- import("https://github.com/byuistats/Math221D_Course/raw/main/Data/All_class_combined_personality_data.csv") %>% tibble()

2 - Explore the data and generate hypotheses

In this section, you will explore the dataset including drilling down into individual columns.

Recall a few useful functions for data exploration:

  1. dim() to get the number of rows and columns in a dataset
  2. table() for getting counts of categorical data
  3. unique() for getting a list of each of the distinct values of categorical data
  4. favstats() to get summary statistics for quantitative data
  5. histogram() to visualize the distribution of a single quantitative variable
  6. boxplot() and/or ggplot() to compare the distributions of a quantitative variable for different groups

NOTE: Unless specified, you may use base R functions OR GGPlot.

QUESTION: How many students have responded to this survey? (i.e. how many rows)
ANSWER:

QUESTION: What is the maximum number of languages spoken among respondents?
ANSWER:

QUESTION: Create a histogram of Extroversion. Be sure to label your graph with a title and improved x and y axis labels.

# Put your histogram code here:  

QUESTION: What is the highest score for Extroversion?
ANSWER:

QUESTION: What problems, if any, do you notice?
ANSWER:

QUESTION: Make a table of birth months:

# Answer

QUESTION: What problems, if any, do you notice?
ANSWER:

3 & 4 - Prepare data for analysis and Create Visualization

In this section, you will be asked to create a clean dataset for each visualization.

Recall a few useful functions for data wrangling:

  1. filter() to include or exclude specific rows
  2. select() to include specific columns

Extroversion

QUESTION: Create a new dataset called extro that includes only columns for birth month and extroversion scores. Make sure it only has values that are real.

HINT: Extroversion is measured in percentiles, and you should already know what the months of the year are.


extro <- 
Error: <text>:4:0: unexpected end of input
2: extro <- 
3: 
  ^

QUESTION: USE GGPLOT to create a side-by-side boxplot of Extroversion scores for all birth months.

ggplot()

QUESTION: Based on the boxplot, which month appears to be the least extroverted? Explain your reasoning.
ANSWER:

QUESTION: Based on the boxplot, which month appears to be the most extroverted? Explain your reasoning.
ANSWER:

Neuroticism

QUESTION: Create a dataset called neuro that includes only the Section and Neuroticism columns:


neuro <- 
  
Error: <text>:4:0: unexpected end of input
2: neuro <- 
3:   
  ^

QUESTION: USE GGPLOT to create a side-by-side boxplot comparing Neuroticism for all the different sections:

QUESTION: Based on the boxplot, which section appears to be the lowest in trait neuroticism? Explain your reasoning.
ANSWER: