library(tidyverse)
library(car)
library(mosaic)
library(rio)
<- import("https://github.com/byuistats/Math221D_Course/raw/main/Data/All_class_combined_personality_data.csv") %>% tibble() big5
Data Wrangling - Application Activity
Personality Revisited
Introduction
The Big 5 personality test is the most widely accepted tool for modelling personality in academic psychology. It is based on decades of statistical analysis of personality descriptions across languages and cultures. The big 5 traits are:
- Openness
- Conscientiousness
- Extroversion
- Agreeableness
- Neuroticism
Brother Cannon collected personality data on students for the past several semesters, including a few metrics that may be associated with personality traits.
NOTE: Scores for personality traits are given in percentiles relative to the general population.
In this activity, you will practice the process for approaching a dataset outlined in class:
- Load the data and libraries
- Explore the data and generate hypotheses
- Prepare the data for analysis
- Perform the appropriate analysis
Data preparation will include using the filter()
function. For now, analysis means creating good visualizations that tell a story using ggplot()
and base R.
1 - Load the Data and Libraries
Run the following code to load the necessary R libraries and import the big 5 personality data:
2 - Explore the data and generate hypotheses
In this section, you will explore the dataset including drilling down into individual columns.
Recall a few useful functions for data exploration:
dim()
to get the number of rows and columns in a datasettable()
for getting counts of categorical dataunique()
for getting a list of each of the distinct values of categorical datafavstats()
to get summary statistics for quantitative datahistogram()
to visualize the distribution of a single quantitative variableboxplot()
and/orggplot()
to compare the distributions of a quantitative variable for different groups
NOTE: Unless specified, you may use base R functions OR GGPlot.
QUESTION: How many students have responded to this survey? (i.e. how many rows)
ANSWER:
QUESTION: What is the maximum number of languages spoken among respondents?
ANSWER:
QUESTION: Create a histogram of Extroversion. Be sure to label your graph with a title and improved x and y axis labels.
# Put your histogram code here:
QUESTION: What is the highest score for Extroversion?
ANSWER:
QUESTION: What problems, if any, do you notice?
ANSWER:
QUESTION: Make a table of birth months:
# Answer
QUESTION: What problems, if any, do you notice?
ANSWER:
3 & 4 - Prepare data for analysis and Create Visualization
In this section, you will be asked to create a clean dataset for each visualization.
Recall a few useful functions for data wrangling:
filter()
to include or exclude specific rowsselect()
to include specific columns
Extroversion
QUESTION: Create a new dataset called extro
that includes only columns for birth month and extroversion scores. Make sure it only has values that are real.
HINT: Extroversion is measured in percentiles, and you should already know what the months of the year are.
<- extro
Error: <text>:4:0: unexpected end of input
2: extro <-
3:
^
QUESTION: USE GGPLOT to create a side-by-side boxplot of Extroversion scores for all birth months.
ggplot()
QUESTION: Based on the boxplot, which month appears to be the least extroverted? Explain your reasoning.
ANSWER:
QUESTION: Based on the boxplot, which month appears to be the most extroverted? Explain your reasoning.
ANSWER:
Neuroticism
QUESTION: Create a dataset called neuro
that includes only the Section and Neuroticism columns:
<-
neuro
Error: <text>:4:0: unexpected end of input
2: neuro <-
3:
^
QUESTION: USE GGPLOT to create a side-by-side boxplot comparing Neuroticism for all the different sections:
QUESTION: Based on the boxplot, which section appears to be the lowest in trait neuroticism? Explain your reasoning.
ANSWER: