Introducing the Bivariate Data

Introduction

Bivariate data refers to situations where you have one quantitative response variable and one quantitative explanatory variable. This requires a different approach to analysis and visualization.

By the end of this lesson, you should be able to:

Create scatterplots for 2 quantitative variables in base R and GGPlot
Describe what the correlation coefficient, $r$ , quantifies
Calculate $r$ using the cor() function

Two datasets will be used to illustrate these concepts. The first contains self-reported confidence in mathematics and test scores. The second contains eruption duration and time between eruptions of Old Faithful geyser in Yellowstone National Park.

# Load the libraries and data
library(tidyverse)
library(mosaic)
library(rio)

geyser <- import('https://byuistats.github.io/BYUI_M221_Book/Data/OldFaithful.xlsx')
names(geyser)

[1] "Duration" "Wait"     "Source"

math <- import('https://byuistats.github.io/BYUI_M221_Book/Data/MathSelfEfficacy.xlsx')
names(math)

[1] "Gender"               "Score"                "ConfidenceRatingMean"
[4] "Comments"

Correlation Coefficient

You may recall from algebra the expression

$y = m x + b$ That describes the functional form of a line with slope, $m$ , and $y$ -intercept, $b$ . The slope could be positive or negative depending on the relationship between $x$ and $y$ . A positive slope meant that as $x$ increases, $y$ also increases. A negative slope meant the opposite: as $x$ increases, $y$ decreases. The same can be true for 2 quantitative data measurements.

We say that 2 variables are “correlated” if as one variable changes, the other tends to change as well. However, the word “correlation” means something more than how we use the word “correlated” in everyday use.

KEY DEFINITION: Correlation is a mathematical quantity describing the strength and direction of a linear relationship between 2 quantitative variables.

The Correlation Coefficient is called $r$ . It is a number between -1 and 1, with -1 being a perfect, negative correlation (ie. a straight line) and 1 being a perfect, positive correlation.

See the images below for scenarios for different correlation coefficients:

Notice the panel in the bottom right corner. There is clearly a perfect relationship between $x$ and $y$ , yet $r = 0$ . This is because $r$ measures the strength and direction of a linear relationship. If the relationship is not linear, then $r$ can still be zero.

We check for linearity by observation of the scatterplots.

Question: What does the correlation coefficient, $r$ , quantify? Answer:

Question: What does it mean if the correlation coefficient, $r$ , is negative? Answer:

Math Confidence

Scatter plot

Make a scatter plot showing the relationship between students’ self reported confidence rating and test score.

Question: Which variable is the Explanatory variable, $x$ ?
Answer:

Question: Which is the Response variable, $y$ ?
Answer:

# Base R
plot(Score ~ ConfidenceRatingMean, data = math)

# ggplot
ggplot(math, aes(x = ConfidenceRatingMean, y = Score )) +
  geom_point(color = "darkblue") +
  theme_bw() +
  labs(
    title = "Relationship between Student Confidence Rating in Math and Test Score"
  )

Question: Before calculating the Correlation Coefficient, r, describe in words the direction and strength of the relationship.
Answer:

Question: What’s your best guess at, $r$ based on the scatterplot?
Answer:

Question: Does it look linear?
Answer:

Calculate the Correlation Coefficient, $r$ :

cor(Score ~ ConfidenceRatingMean, data = math)

[1] 0.7278648

Question: How far off was your guess?
Answer:

Old Faithful