Response and Explanatory Variables

Introduction

In this section, we will:

  1. Review the definitions of Response/Dependent variables and Explanatory/Independent variables
  2. Learn some R tools to help us look at the data (View(), glimpse(), names())
  3. Look at several datasets and identify the response variables and explanatory variables
  4. Identify data types (categorical or quantitative)
  5. Identify each level of a categorical variable

Load the Libraries

Don’t forget, before we can use the tools in the different toolboxes, we need to retrieve them from the shed. In R, that means, loading the libraries:

library(tidyverse)
library(mosaic)
library(car)
library(rio)

Define Your Terms

Data

Datasets contain rows and columns. Rows in a dataset represent individual observations or records. Each row contains all the relevant data for a single case, instance, or subject being studied. For example, in a dataset about students, each row might represent a different student, with all their associated data (such as name, age, and grade) contained within that row.

Columns in a dataset represent the different variables or attributes being measured or recorded. Each column contains one specific type of data across all the observations. For example, in the same student dataset, there could be a column for “Name,” a column for “Age,” and another for “Grade.”

NOTE: Column names in a dataset are considered variables.

Together, rows and columns provide a structured format that makes it easy to organize, analyze, and visualize the data.

Data Types (Review)

In this class, we will use the broad categories of “quantitative” and “categorical” variables to distinguish between numerical values and data that represent group information respectively. As we will see below, it’s not always as easy as it seems to tell the difference.

Quantitative Variables

Variables that are quantitative consists of values that represent measurable quantities.

Categroical Variables

Categorical data consists of values that represent categories or groups. We refer to each of the categories as “levels” of the variable.

To see all of the levels of a categorical variable, we can use the unique() function. The code chunk below has a list of favorite pets. The unique() function takes as an input a categorical variable and returns a list of the distinct values.

favorite_pet <- c("cat", "cat", "dog", "ferret", "cat", "cat", "dog", "lizard", "dog")

unique(favorite_pet)
[1] "cat"    "dog"    "ferret" "lizard"

This lists all the “levels” of a categorical variable.

Response/Dependent Variable

The response variable, also known as the dependent variable, is the outcome or the variable that researchers are interested in explaining or predicting. Its value depends on the influence of other variables. For example, in a study examining the effect of study time on exam scores, the exam score is the response variable because it changes in response to different amounts of study time.

Explanatory/Independent Variable

The explanatory variable, also known as the independent variable, is the variable that is manipulated or controlled in a statistical experiment to observe its effect on the response variable. It is considered the cause or predictor in the relationship being studied. In the example above, the amount of study time is the explanatory variable, as it is the factor that explains exam scores.

Madison Countr Real Estate

This dataset contains information about properties in Madison County.

QUESTION: Before looking at the data, what do you think is the purpose of collecting these data?
ANSWER:

QUESTION: What do you think the Response variable is?
ASNWER:

real_estate <- import('https://github.com/byuistats/Math221D_Course/raw/refs/heads/main/Data/MadisonCountyRealEstate.xlsx') %>% tibble()

Functions in R perform a task or calculation on a given input. For example, in the code chunk above, we used the import() function which took as an input a web address where the dataset is stored.

But all that did was create a dataset called real_estate. If we want to look at the dataset, we can but the dataset name into the View() function.

View(real_estate)

The View() function takes a dataset name as an input and opens up a window that looks like an Excel spreadsheet. This allows us to see what the data contains.

Useful Functions for Seeing What’s In a Dataset

  1. View() takes a dataset name as an input and opens a window and lets you scroll around and look at the complete dataset
  2. names() takes a dataset name as an input and prints out the column names
  3. glimpse() takes a dataset name as an input and prints out a list of the column names and tells you a little aobut each column and shows the first few values of each column

Try each of the functions in the code chunk below:

names()  
Error in names(): 0 arguments passed to 'names' which requires 1
glimpse()  
Error in glimpse.default(): argument "x" is missing, with no default

QUESTION: After looking at the data, what is the response variable? (use the column name)
ANSWER:

QUESTION: Is it quantitative or categorical?
ANsWER:

QUESTION: What variables do you think are the 5 most impactful explanatory variables? List the column names and say whether or not each are categorical or quantitative.

For any categorical variables, print out the levels:

unique(real_estate$)
Error in parse(text = input): <text>:2:20: unexpected ')'
1: 
2: unique(real_estate$)
                      ^

Cadillac

This dataset contains information about different Cadillac models, their mileage and price.

Read in the data and look it over.

cadillac_data <- import('https://github.com/byuistats/Math221D_Course/raw/refs/heads/main/Data/Cadillac.txt')

QUESTION: What is the response variable?
ANSWER:

QUESTION: Is it quantitative or categorical?
ANSWER:

QUESTION: List some of the explanatory variables and say whether they are categorical or quantitative:

For any categorical variables, print out the levels:

unique(cadillac_data$)
Error in parse(text = input): <text>:2:22: unexpected ')'
1: 
2: unique(cadillac_data$)
                        ^

Chocolate

This dataset contains information about different chocolate brands.

Read in the data and look it over.

chocolate <- import('https://github.com/byuistats/Math221D_Course/raw/refs/heads/main/Data/Chocolate.csv')

QUESTION: What is the response variable?
ANSWER:

QUESTION: Is it quantitative or categorical?
ANSWER:

QUESTION: List some of the explanatory variables and say whether they are categorical or quantitative:

High School Survey

This dataset contains survey responses for high school students. This one is very large, with lots of columns.

Read in the data and look it over.

hs_survey <- import('https://github.com/byuistats/Math221D_Course/raw/refs/heads/main/Data/HighSchoolSeniors.csv')

QUESTION: What are some possible response variables? List them and label them as categorical or quantitative:

QUESTION: For one of the above response variables, list some of the explanatory variables and say whether they are categorical or quantitative:

Good Data

Good Data starts with a well-defined study. You should know ahead of time what your response and explanatory variables are.