Lesson 21: Describing Bivariate Data

Optional Videos for this Lesson

Part 1

Part 2

Part 3

Part 4

Lesson Outcomes

By the end of this lesson, you should be able to:

Create a scatterplot of bivariate data
Interpret the overall pattern in a scatter plot to assess linearity and direction
Calculate the correlation coefficient, \(r\)
Identify the properties of the correlation coefficient
Interpret the correlation coefficient as a measure of the strength and direction of the linear relationship between two variables

How Confident are You?

Think about a time when you walked into an exam, having prepared carefully, knowing that you would do well. On the other hand, have you ever entered an exam feeling unprepared? How have your exam scores compared to your confidence?

Shane Goodwin and other researchers examined this question. They studied factors that affect a student’s confidence on a multiple-choice mathematics exam. A group of n = 139 students in an Intermediate Algebra course (MATH 101) at BYU-Idaho participated in the study. In addition to marking their test question responses, they evaluated their confidence for each answer on a scale of 1 to 6. The confidence rating scale is summarized in the following table:

Confidence Rating

Random guess (no clue)
Very unsure
Somewhat unsure
Somewhat sure
Very sure
Certain (absolutely sure)

Confidence ratings were not relayed to the instructor, and they did not affect the grade on the exam.

For each student, the mean confidence rating was computed. This mean confidence rating and their score on the exam (out of 100 points) are given in the file MathSelfEfficacy.xlsx.

Previously, we have been dealing with one response variable at a time. Now, we have two quantitative measurements on each unit (participant). We call these data bivariate data, since there are two (bi-) variables that we are considering simultaneously.

In the past, we have summarized quantitative data by computing summary statistics. Here are a couple of statistics computed from these data:

The mean score on the test was 74.7 points.
The mean confidence rating was 4.4.

These statistics do not provide information about the connection between the students’ scores on the exam and their confidence. If a student feels very confident, what do the data tell us about their test score? We need a new tool to help us relate the values of two quantitative observations. When we have two quantitative measurements on a unit, we have bivariate data.

Describing Bivariate Data

Describing Form: Scatterplots

The following scatterplot illustrates the data from the Goodwin study.

Each point in the plot represents both the confidence and score of one student. The points are plotted on the X-Y coordinate plane. The position on the horizontal (X) axis represents the student’s confidence rating. The height of the point or the value on the Y-axis, represents the student’s score on the exam. (We will explore how to create a scatterplot in the next example. For now, focus on understanding the interpretation of the graph.)

The cloud of data illustrated in this scatterplot help us visualize the relationship between the student’s confidence rating and their score on the exam. Notice that the points tend to be higher as you move to the right. Students who have a high confidence rating (points further to the right) tend to have higher exam scores (higher vertical position). Similarly, students with lower confidence typically have lower exam scores. Notice that as a students’ confidence increases, their exam score tends to increase. We call this a positive association or a positive correlation.

Notice that there is variability in the responses. Consider the students who have a mean confidence rating of 5.0. The points above this number represent those students who reported a mean confidence value of about 5. There is variability in the exam scores of these students. They range from about 75 to approximately 100.

In the scatterplot, we see a cloud of data. Even though there is a considerable amount of variability in the data, the points tend to follow a line. If you squint with your eyes, you might imagine that the data look like a fat hot dog.

When the points in a scatterplot follow a straight pattern, we say that there is a linear relationship in the data. Data are considered linearly related if the points in the scatterplot follow a straight line. The points do not have to be aligned tightly to represent a linear relationship. They can be in the shape of a long skinny cucumber or a short, fat cucumber. Both broad and narrow clouds of data can be considered linear.

How to make a scatterplot

Estuarine Crocodile Data

Data for estuarine, or saltwater, crocodiles is given in the file EstuarineCrocodiles(Modified).xlsx. We will illustrate the relationship between the head length of the crocodiles and their body lengths by creating a scatterplot.

Excel Instructions

To make a scatterplot in Excel:

Open the Math 221 Statistics Toolbox file and select the “Linear Regression” tab.
Put the variable that you want on the x-axis (head length) in column A, labeled as the “X” column.
Put the variable that you want on the y-axis (body length) in column B, labeled as the “Y” column. This is the value you want to predict.
The points in the scatterplot will update with your data.

Here is the scatterplot for the estuarine crocodile data. The head length in centimeters is on the horizontal (x) axis and the body length in centimeters is on the vertical (y) axis.

Notice that there is a strong positive linear relationship between the head lengths and the body lengths of the crocodiles. That fact will be important in the next lesson.

We want to be able to describe the relationship between the variables. The first thing we look for is the shape or form observed in the scatterplot. Is it linear or nonlinear?

In many cases, the points on a scatterplot do not follow a straight line. If the data form a curved shape, e.g. a banana shape, we say that there is a nonlinear relationship in the data. The methods presented in this course do not directly apply to nonlinear data. As a professional, you may encounter nonlinear data. Math 325 Intermediate Statistical Methods includes ways to handle nonlinear relationships. If you cannot take additional statistics courses, you should consult a statistician if you want to analyze the relationships observed in nonlinear bivariate data.

In addition to the shape or the form of the data observed in the scatterplot, we need to be able to describe the direction and strength of a linear relationship in data. We use the correlation coefficient to quantify the direction and strength of the relationship. These ideas are discussed below.

Describing Direction: Scatterplots and the Correlation Coefficient

We say that the direction of data in a scatterplot is positive or there is a positive association between two variables when an increase in one variable tends to lead to an increase in the other variable. We observed a positive association in Goodwin’s confidence data.

The correlation coefficient is a number that is used to measure the direction and strength of the linear association between two variables. The direction is either positive, negative, or neither. The strength can be described as weak, moderate, or strong.

Correlation coefficients are always between \(-1\) and \(1\).

We will use software to compute the correlation coefficient. For the Goodwin data, the correlation coefficient is: \[r = 0.728\] We use the symbol \(r\) to represent the correlation coefficient. In this reading, we will explore the correlation coefficient, including its properties and interpretation.

When a positive association exists in the data, the correlation coefficient will be positive. As an example, a positive association was observed in Goodwin’s data and \(r = 0.728 > 0\).

There are many examples of positive associations. It has been demonstrated that a student’s level of motivation is positively associated with academic success . Students who are highly motivated tend to do better academically. As another example, there is a positive association between the height of a person and their weight. If someone’s height increases, we would expect that their weight would typically increase as well.

When an increase in one variable is associated with a decrease in the other variable, we say that there is a negative association between the two variables. Several studies have demonstrated that there is a negative association between the amount of time spent playing video games and academic performance. Students who spend a lot of time on video games tend to do worse in school than their peers who do not spend much time gaming.

Describing Strength: The Correlation Coefficient

We also describe the relationship between two variables as weak, moderate, or strong, depending on how close the relationship between the variables is. The strength of the linear relationship is also described in the correlation coefficient.

The correlation coefficient is always between \(-1\) and \(1\). If there is a strong positive association, the correlation coefficient will be close to \(1\). If the correlation coefficient is positive but relatively close to 0, we say there is a weak positive association in the data.

Similarly, if the correlation coefficient is close to \(-1\), we say there is a strong negative association. A weak negative association results in a correlation coefficient that is negative but close to 0.

We will not establish cut-off values to determine when a correlation goes from being weak to moderate or from moderate to strong. This depends upon the application and is very subjective.

Several scatterplots have been created, and the correlation coefficient summarizing the relationship between the two variables is presented. Study these graphs to see if you can infer some of the properties of the correlation coefficient.