Harmonic Seasonal Variables - Part 2

Chapter 5: Lesson 3

Learning Outcomes

Fit linear regression models to time series data
  • State the additive model with harmonic seasonal component
  • Simulate a time series with harmonic seasonal components
  • Identify an appropriate function to model the trend in a given time series
  • Identify a parsimonious set of harmonic terms for use in a regression model
  • Fit the additive model with harmonic seasonal component to real-world data
  • Evaluate residuals using a correlogram and partial correlogram to ensure they meet the assumptions
Apply model selection criteria
  • Use AIC to aid in model selection

Preparation

  • Review Section 5.6

Class Activity: Monthly Average High Temperature in Rexburg (20 min)

Visualization of the Time Series

Consider the mean monthly high temperature in Rexburg.

Show the code
weather_df <- rio::import("https://byuistats.github.io/timeseries/data/rexburg_weather_monthly.csv") |>
  mutate(dates = my(date_text)) |>
  filter(dates >= my("1/2008") & dates <= my("12/2023")) |>
  rename(x = avg_daily_high_temp) |>
  mutate(TIME = 1:n()) |>
  mutate(
    cos1 = cos(2 * pi * 1 * TIME/12),
    cos2 = cos(2 * pi * 2 * TIME/12),
    cos3 = cos(2 * pi * 3 * TIME/12),
    cos4 = cos(2 * pi * 4 * TIME/12),
    cos5 = cos(2 * pi * 5 * TIME/12),
    cos6 = cos(2 * pi * 6 * TIME/12),
    sin1 = sin(2 * pi * 1 * TIME/12),
    sin2 = sin(2 * pi * 2 * TIME/12),
    sin3 = sin(2 * pi * 3 * TIME/12),
    sin4 = sin(2 * pi * 4 * TIME/12),
    sin5 = sin(2 * pi * 5 * TIME/12),
    sin6 = sin(2 * pi * 6 * TIME/12)) |>
  as_tsibble(index = TIME)

weather_df |>
  as_tsibble(index = dates) |>
  autoplot(.vars = x) +
  geom_smooth(method = "lm", se = FALSE, color = "#F0E442") +
    labs(
      x = "Month",
      y = "Mean Daily High Temperature (Fahrenheit)",
      title = "Time Plot of Mean Daily Rexburg High Temperature by Month",
      subtitle = paste0("(", format(weather_df$dates %>% head(1), "%b %Y"), endash, format(weather_df$dates %>% tail(1), "%b %Y"), ")")
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5)
    )

Model Selection

Standardizing the Time Variable

Standardizing the Time Variable

To avoid serious floating point errors, we standardize the time variable. First, compute the sine and cosine terms using the original time variable, then transform the time variable by subtracting its mean and dividing by its standard deviation. (In other words, compute a \(z\)-score.)

The model is adjusted accordingly after fitting.

Warning

When the independent variable (the measure of time) is large, floating point errors in the computation of the regression coefficents can be substantial.

We will demonstrate standardizing the time variable below.

Computing the standardized time variable

Our time variable was a simple incremented value counting the months ranging from 1 (representing 1/2008) to 192 (representing 12/2023).

We standardize this variable by the transformation:

\[ zTIME = \frac{t - \bar t}{s_t} = \frac{t - 96.5}{55.57} \]

where the mean of the variable TIME is \(\bar t = 96.5\), and the standard deviation is \(s_t = 55.57\).

weather_df <- weather_df |>
  mutate(zTIME = (TIME - mean(TIME)) / sd(TIME))

Now, we fit the trend components of the models using zTIME instead of TIME. We will start by modeling a cubic trend term.

Cubic Trends

Cubic Trend: Full Model

Visually, we can identify a positive linear trend in the data. It is possible that there are higher-order properties of the trend. We will include a quadratic and cubic term in our search for a model.

In addition to modeling the trend, we need to include terms for the seasonal component. We start with a full model that includes all six of the the sine and cosine terms from the summation above.

Show the code
# Cubic model with standardized time variable

full_cubic_lm <- weather_df |>
  model(full_cubic = TSLM(x ~ zTIME + I(zTIME^2) + I(zTIME^3) +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ))

full_cubic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 15 × 7
   .model     term        estimate std.error statistic   p.value sig  
   <chr>      <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
 1 full_cubic (Intercept)  57.2        0.405   141.    7.11e-184 TRUE 
 2 full_cubic zTIME        -0.359      0.678    -0.529 5.98e-  1 FALSE
 3 full_cubic I(zTIME^2)   -0.733      0.303    -2.42  1.67e-  2 TRUE 
 4 full_cubic I(zTIME^3)    0.586      0.348     1.68  9.41e-  2 FALSE
 5 full_cubic sin1        -16.0        0.383   -41.8   1.23e- 93 TRUE 
 6 full_cubic cos1        -23.7        0.382   -62.1   4.30e-122 TRUE 
 7 full_cubic sin2          1.29       0.382     3.37  9.34e-  4 TRUE 
 8 full_cubic cos2         -2.87       0.382    -7.51  2.74e- 12 TRUE 
 9 full_cubic sin3         -0.819      0.382    -2.15  3.33e-  2 TRUE 
10 full_cubic cos3          0.608      0.382     1.59  1.13e-  1 FALSE
11 full_cubic sin4          0.377      0.382     0.987 3.25e-  1 FALSE
12 full_cubic cos4         -0.214      0.382    -0.561 5.76e-  1 FALSE
13 full_cubic sin5         -0.0810     0.382    -0.212 8.32e-  1 FALSE
14 full_cubic cos5         -0.0433     0.382    -0.113 9.10e-  1 FALSE
15 full_cubic cos6         -0.567      0.270    -2.10  3.70e-  2 TRUE 
Show the code
forecast_df <- full_cubic_lm |> forecast(weather_df, ) |> as_tibble() |> dplyr::select(zTIME, .mean) |> rename(pred = .mean)

weather_df |>
  left_join(forecast_df, by = "zTIME") |>
  as_tsibble(index = dates) |>
  autoplot(.vars = x) +
  geom_smooth(method = "lm", se = FALSE, color = "#F0E442") +
  geom_line(aes(y = pred), color = "#56B4E9", alpha = 0.75) +
    labs(
      x = "Month",
      y = "Mean Daily High Temperature (Fahrenheit)",
      title = "Time Plot of Mean Daily Rexburg High Temperature by Month",
      subtitle = paste0("(", format(weather_df$dates %>% head(1), "%b %Y"), endash, format(weather_df$dates %>% tail(1), "%b %Y"), ")")
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5)
    )
`geom_smooth()` using formula = 'y ~ x'

The cubic term in the trend is not significant, but the quadratic term is. We will fit a quadratic model.

Caution

If you choose a different range of dates, you may get a different result. The regression model is fitted to the data, not to the physical situation.

Quadratic Trends

Quadratic Trend: Full Model

We now fit a quadratic model that includes all of the seasonal terms.

Show the code
full_quadratic_lm <- weather_df |>
  model(full_quadratic = TSLM(x ~ zTIME + I(zTIME^2) +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ))

full_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 14 × 7
   .model         term        estimate std.error statistic   p.value sig  
   <chr>          <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
 1 full_quadratic (Intercept)  57.2        0.407  141.     2.70e-184 TRUE 
 2 full_quadratic zTIME         0.688      0.273    2.52   1.25e-  2 TRUE 
 3 full_quadratic I(zTIME^2)   -0.733      0.305   -2.40   1.73e-  2 TRUE 
 4 full_quadratic sin1        -16.1        0.384  -41.8    5.46e- 94 TRUE 
 5 full_quadratic cos1        -23.7        0.384  -61.8    3.68e-122 TRUE 
 6 full_quadratic sin2          1.26       0.384    3.29   1.19e-  3 TRUE 
 7 full_quadratic cos2         -2.86       0.384   -7.44   4.03e- 12 TRUE 
 8 full_quadratic sin3         -0.832      0.384   -2.17   3.15e-  2 TRUE 
 9 full_quadratic cos3          0.620      0.384    1.62   1.08e-  1 FALSE
10 full_quadratic sin4          0.370      0.384    0.963  3.37e-  1 FALSE
11 full_quadratic cos4         -0.202      0.384   -0.525  6.00e-  1 FALSE
12 full_quadratic sin5         -0.0844     0.384   -0.220  8.26e-  1 FALSE
13 full_quadratic cos5         -0.0307     0.384   -0.0799 9.36e-  1 FALSE
14 full_quadratic cos6         -0.561      0.271   -2.07   4.01e-  2 TRUE 

The coefficient on the quadratic trend term is very small and negative. This suggests the data may indicate a decrease in the rate of global warming.

Quadratic Trend: Reduced Model 1

Eliminating the \(i=4\) and \(i=5\) terms, we get the model:

Show the code
reduced1_quadratic_lm <- weather_df |>
  model(reduced_quadratic_1  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + cos6))

reduced1_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 10 × 7
   .model              term        estimate std.error statistic   p.value sig  
   <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
 1 reduced_quadratic_1 (Intercept)   57.2       0.404    142.   4.02e-188 TRUE 
 2 reduced_quadratic_1 zTIME          0.684     0.271      2.53 1.23e-  2 TRUE 
 3 reduced_quadratic_1 I(zTIME^2)    -0.733     0.303     -2.42 1.64e-  2 TRUE 
 4 reduced_quadratic_1 sin1         -16.1       0.381    -42.1  8.23e- 96 TRUE 
 5 reduced_quadratic_1 cos1         -23.7       0.381    -62.2  1.34e-124 TRUE 
 6 reduced_quadratic_1 sin2           1.26      0.381      3.32 1.09e-  3 TRUE 
 7 reduced_quadratic_1 cos2          -2.86      0.381     -7.50 2.71e- 12 TRUE 
 8 reduced_quadratic_1 sin3          -0.832     0.381     -2.18 3.02e-  2 TRUE 
 9 reduced_quadratic_1 cos3           0.621     0.381      1.63 1.05e-  1 FALSE
10 reduced_quadratic_1 cos6          -0.561     0.269     -2.08 3.86e-  2 TRUE 

Quadratic Trend: Reduced Model 2

Sequentially removing the cosine term for \(i = 3\), we get:

Show the code
reduced2_quadratic_lm <- weather_df |>
  model(reduced_quadratic_2  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1 + sin2 + cos2 + sin3 + cos6))

reduced2_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 9 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_2 (Intercept)   57.2       0.406    141.   1.41e-188 TRUE 
2 reduced_quadratic_2 zTIME          0.689     0.272      2.54 1.20e-  2 TRUE 
3 reduced_quadratic_2 I(zTIME^2)    -0.733     0.304     -2.41 1.69e-  2 TRUE 
4 reduced_quadratic_2 sin1         -16.1       0.383    -41.9  8.35e- 96 TRUE 
5 reduced_quadratic_2 cos1         -23.7       0.383    -62.0  9.99e-125 TRUE 
6 reduced_quadratic_2 sin2           1.26      0.383      3.30 1.14e-  3 TRUE 
7 reduced_quadratic_2 cos2          -2.86      0.383     -7.47 3.23e- 12 TRUE 
8 reduced_quadratic_2 sin3          -0.832     0.383     -2.17 3.10e-  2 TRUE 
9 reduced_quadratic_2 cos6          -0.561     0.271     -2.07 3.95e-  2 TRUE 

Quadratic Trend: Reduced Model 3

This is the quadratic model obtained by eliminating the \(i=6\) term from the previous model.

Show the code
reduced3_quadratic_lm <- weather_df |>
  model(reduced_quadratic_3  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1 + sin2 + cos2 + sin3))

reduced3_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 8 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_3 (Intercept)   57.2       0.409    140.   1.12e-188 TRUE 
2 reduced_quadratic_3 zTIME          0.684     0.274      2.50 1.34e-  2 TRUE 
3 reduced_quadratic_3 I(zTIME^2)    -0.733     0.307     -2.39 1.79e-  2 TRUE 
4 reduced_quadratic_3 sin1         -16.1       0.386    -41.6  1.76e- 95 TRUE 
5 reduced_quadratic_3 cos1         -23.7       0.386    -61.4  1.63e-124 TRUE 
6 reduced_quadratic_3 sin2           1.26      0.386      3.28 1.26e-  3 TRUE 
7 reduced_quadratic_3 cos2          -2.86      0.386     -7.40 4.67e- 12 TRUE 
8 reduced_quadratic_3 sin3          -0.832     0.386     -2.16 3.25e-  2 TRUE 

Quadratic Trend: Reduced Model 4

This is similar to the previous model, except both the sine and cosine terms are included for \(i=3\).

Show the code
reduced4_quadratic_lm <- weather_df |>
  model(reduced_quadratic_4  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1 + sin2 + cos2 + sin3 + cos3))

reduced4_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 9 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_4 (Intercept)   57.2       0.408    140.   3.22e-188 TRUE 
2 reduced_quadratic_4 zTIME          0.679     0.273      2.49 1.38e-  2 TRUE 
3 reduced_quadratic_4 I(zTIME^2)    -0.733     0.305     -2.40 1.74e-  2 TRUE 
4 reduced_quadratic_4 sin1         -16.1       0.385    -41.8  1.76e- 95 TRUE 
5 reduced_quadratic_4 cos1         -23.7       0.384    -61.7  2.21e-124 TRUE 
6 reduced_quadratic_4 sin2           1.26      0.384      3.29 1.21e-  3 TRUE 
7 reduced_quadratic_4 cos2          -2.86      0.384     -7.43 3.95e- 12 TRUE 
8 reduced_quadratic_4 sin3          -0.832     0.384     -2.16 3.17e-  2 TRUE 
9 reduced_quadratic_4 cos3           0.621     0.384      1.61 1.08e-  1 FALSE

Quadratic Trend: Reduced Model 5

This quadratic model only includes the \(i=1\) and \(i=2\) terms.

Show the code
reduced5_quadratic_lm <- weather_df |>
  model(reduced_quadratic_5  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1 + sin2 + cos2))

reduced5_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 7 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_5 (Intercept)   57.2       0.413    138.   1.06e-188 TRUE 
2 reduced_quadratic_5 zTIME          0.692     0.277      2.50 1.33e-  2 TRUE 
3 reduced_quadratic_5 I(zTIME^2)    -0.733     0.310     -2.37 1.90e-  2 TRUE 
4 reduced_quadratic_5 sin1         -16.1       0.390    -41.2  4.40e- 95 TRUE 
5 reduced_quadratic_5 cos1         -23.7       0.390    -60.8  3.16e-124 TRUE 
6 reduced_quadratic_5 sin2           1.26      0.390      3.24 1.40e-  3 TRUE 
7 reduced_quadratic_5 cos2          -2.86      0.390     -7.33 6.96e- 12 TRUE 

Quadratic Trend: Reduced Model 6

This model includes a quadratic effect for time, but only includes the sine and cosine terms corresponding to \(i=1\).

Show the code
reduced6_quadratic_lm <- weather_df |>
  model(reduced_quadratic_6  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1))

reduced6_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 5 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_6 (Intercept)   57.2       0.477    120.   9.32e-179 TRUE 
2 reduced_quadratic_6 zTIME          0.646     0.319      2.02 4.45e-  2 TRUE 
3 reduced_quadratic_6 I(zTIME^2)    -0.734     0.358     -2.05 4.14e-  2 TRUE 
4 reduced_quadratic_6 sin1         -16.1       0.451    -35.7  2.32e- 85 TRUE 
5 reduced_quadratic_6 cos1         -23.7       0.450    -52.7  4.14e-114 TRUE 
Linear Trends

Linear Trend: Full Model

This is the full model with a linear time component. All of the period functions are included from \(i=1\) to \(i=6\).

Show the code
full_linear_lm <- weather_df |>
  model(full_linear = TSLM(x ~ zTIME +
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
    + sin4 + cos4 + sin5 + cos5 + cos6 ))

full_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 13 × 7
   .model      term        estimate std.error statistic   p.value sig  
   <chr>       <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
 1 full_linear (Intercept)  56.5        0.275  205.     2.02e-214 TRUE 
 2 full_linear zTIME         0.688      0.276    2.49   1.37e-  2 TRUE 
 3 full_linear sin1        -16.1        0.389  -41.3    2.19e- 93 TRUE 
 4 full_linear cos1        -23.7        0.389  -61.0    1.16e-121 TRUE 
 5 full_linear sin2          1.26       0.389    3.25   1.38e-  3 TRUE 
 6 full_linear cos2         -2.86       0.389   -7.35   6.86e- 12 TRUE 
 7 full_linear sin3         -0.832      0.389   -2.14   3.37e-  2 TRUE 
 8 full_linear cos3          0.620      0.389    1.59   1.13e-  1 FALSE
 9 full_linear sin4          0.370      0.389    0.950  3.43e-  1 FALSE
10 full_linear cos4         -0.202      0.389   -0.518  6.05e-  1 FALSE
11 full_linear sin5         -0.0845     0.389   -0.217  8.28e-  1 FALSE
12 full_linear cos5         -0.0307     0.389   -0.0789 9.37e-  1 FALSE
13 full_linear cos6         -0.561      0.275   -2.04   4.28e-  2 TRUE 

Linear Trend: Reduced Model 1

This linear model excludes the terms corresponding to \(i=4\) and \(i=5\).

Show the code
reduced1_linear_lm <- weather_df |>
  model(reduced_linear_1  = TSLM(x ~ zTIME + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + cos6))

reduced1_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 9 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_1 (Intercept)   56.5       0.273    207.   6.64e-219 TRUE 
2 reduced_linear_1 zTIME          0.684     0.274      2.49 1.35e-  2 TRUE 
3 reduced_linear_1 sin1         -16.1       0.386    -41.6  3.44e- 95 TRUE 
4 reduced_linear_1 cos1         -23.7       0.386    -61.4  4.39e-124 TRUE 
5 reduced_linear_1 sin2           1.26      0.386      3.27 1.26e-  3 TRUE 
6 reduced_linear_1 cos2          -2.86      0.386     -7.40 4.65e- 12 TRUE 
7 reduced_linear_1 sin3          -0.832     0.386     -2.16 3.24e-  2 TRUE 
8 reduced_linear_1 cos3           0.620     0.386      1.61 1.10e-  1 FALSE
9 reduced_linear_1 cos6          -0.561     0.273     -2.06 4.12e-  2 TRUE 

Linear Trend: Reduced Model 2

This represents the reduction of the previous model by eliminating the cosine term for \(i=3\).

Show the code
reduced2_linear_lm <- weather_df |>
  model(reduced_linear_2  = TSLM(x ~ zTIME + sin1 + cos1 + sin2 + cos2 + sin3 + cos6))

reduced2_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05) 
# A tibble: 8 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_2 (Intercept)   56.5       0.274    206.   1.56e-219 TRUE 
2 reduced_linear_2 zTIME          0.689     0.275      2.50 1.31e-  2 TRUE 
3 reduced_linear_2 sin1         -16.1       0.388    -41.4  3.41e- 95 TRUE 
4 reduced_linear_2 cos1         -23.7       0.388    -61.2  3.22e-124 TRUE 
5 reduced_linear_2 sin2           1.26      0.388      3.26 1.32e-  3 TRUE 
6 reduced_linear_2 cos2          -2.86      0.388     -7.37 5.48e- 12 TRUE 
7 reduced_linear_2 sin3          -0.832     0.388     -2.15 3.31e-  2 TRUE 
8 reduced_linear_2 cos6          -0.561     0.274     -2.05 4.20e-  2 TRUE 

Linear Trend: Reduced Model 3

This model is similar to the previous one, but the cosine term associated with \(i=6\) has been excluded.

Show the code
reduced3_linear_lm <- weather_df |>
  model(reduced_linear_3  = TSLM(x ~ zTIME + sin1 + cos1 + sin2 + cos2 + sin3))

reduced3_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 7 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_3 (Intercept)   56.5       0.276    204.   8.13e-220 TRUE 
2 reduced_linear_3 zTIME          0.684     0.278      2.47 1.46e-  2 TRUE 
3 reduced_linear_3 sin1         -16.1       0.391    -41.1  6.93e- 95 TRUE 
4 reduced_linear_3 cos1         -23.7       0.391    -60.6  5.06e-124 TRUE 
5 reduced_linear_3 sin2           1.26      0.391      3.23 1.45e-  3 TRUE 
6 reduced_linear_3 cos2          -2.86      0.391     -7.31 7.77e- 12 TRUE 
7 reduced_linear_3 sin3          -0.832     0.391     -2.13 3.46e-  2 TRUE 

Linear Trend: Reduced Model 4

This model consists of a linear trend with all the periodic terms associated with \(i=1\) through \(i=3\).

Show the code
reduced4_linear_lm <- weather_df |>
  model(reduced_linear_4  = TSLM(x ~ zTIME + sin1 + cos1 + sin2 + cos2 + sin3 + cos3))

reduced4_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 8 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_4 (Intercept)   56.5       0.275    205.   3.50e-219 TRUE 
2 reduced_linear_4 zTIME          0.679     0.276      2.45 1.50e-  2 TRUE 
3 reduced_linear_4 sin1         -16.1       0.390    -41.2  7.05e- 95 TRUE 
4 reduced_linear_4 cos1         -23.7       0.389    -60.9  6.98e-124 TRUE 
5 reduced_linear_4 sin2           1.26      0.389      3.25 1.39e-  3 TRUE 
6 reduced_linear_4 cos2          -2.86      0.389     -7.34 6.63e- 12 TRUE 
7 reduced_linear_4 sin3          -0.832     0.389     -2.14 3.38e-  2 TRUE 
8 reduced_linear_4 cos3           0.620     0.389      1.59 1.13e-  1 FALSE

Linear Trend: Reduced Model 5

This model has a linear trend and only includes the periodic terms corresponding to \(i=1\) and \(i=2\).

Show the code
reduced5_linear_lm <- weather_df |>
  model(reduced_linear_5  = TSLM(x ~ zTIME + sin1 + cos1 + sin2 + cos2))

reduced5_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 6 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_5 (Intercept)   56.5       0.279    202.   5.06e-220 TRUE 
2 reduced_linear_5 zTIME          0.692     0.280      2.47 1.44e-  2 TRUE 
3 reduced_linear_5 sin1         -16.1       0.395    -40.7  1.66e- 94 TRUE 
4 reduced_linear_5 cos1         -23.7       0.395    -60.1  9.40e-124 TRUE 
5 reduced_linear_5 sin2           1.26      0.395      3.20 1.60e-  3 TRUE 
6 reduced_linear_5 cos2          -2.86      0.395     -7.24 1.13e- 11 TRUE 

Linear Trend: Reduced Model 6

This simple model includes a linear trend and period terms associated with \(i=1\).

Show the code
reduced6_linear_lm <- weather_df |>
  model(reduced_linear_6  = TSLM(x ~ zTIME + sin1 + cos1))

reduced6_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 4 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_6 (Intercept)   56.5       0.321    176.   1.43e-210 TRUE 
2 reduced_linear_6 zTIME          0.646     0.322      2.01 4.63e-  2 TRUE 
3 reduced_linear_6 sin1         -16.1       0.454    -35.4  5.09e- 85 TRUE 
4 reduced_linear_6 cos1         -23.7       0.454    -52.2  7.20e-114 TRUE 
Model Comparison

Model Comparison

Now, we compare all the models side-by-side.

Show the code
model_combined <- weather_df |>
  model(
    full_cubic = TSLM(x ~ TIME + I(TIME^2) + I(TIME^3) +
      sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
      + sin4 + cos4 + sin5 + cos5 + cos6),
    full_quadratic = TSLM(x ~ TIME + I(TIME^2) +
      sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
      + sin4 + cos4 + sin5 + cos5 + cos6),
    full_linear = TSLM(x ~ TIME  +
      sin1 + cos1 + sin2 + cos2 + sin3 + cos3 
      + sin4 + cos4 + sin5 + cos5 + cos6 ),
    reduced_quadratic_1  = TSLM(x ~ TIME + I(TIME^2) + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + cos6),
    reduced_quadratic_2  = TSLM(x ~ TIME + I(TIME^2) + sin1 + cos1 + sin2 + cos2 + sin3 + cos6),
    reduced_quadratic_3  = TSLM(x ~ TIME + I(TIME^2) + sin1 + cos1 + sin2 + cos2 + sin3),
    reduced_quadratic_4  = TSLM(x ~ TIME + I(TIME^2) + sin1 + cos1 + sin2 + cos2 + sin3 + cos3),
    reduced_quadratic_5  = TSLM(x ~ TIME + I(TIME^2) + sin1 + cos1 + sin2 + cos2),
    reduced_quadratic_6  = TSLM(x ~ TIME + I(TIME^2) + sin1 + cos1),
    reduced_linear_1  = TSLM(x ~ TIME + sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + cos6),
    reduced_linear_2  = TSLM(x ~ TIME + sin1 + cos1 + sin2 + cos2 + sin3 + cos6),
    reduced_linear_3  = TSLM(x ~ TIME + sin1 + cos1 + sin2 + cos2 + sin3),
    reduced_linear_4  = TSLM(x ~ TIME + sin1 + cos1 + sin2 + cos2 + sin3 + cos3),
    reduced_linear_5  = TSLM(x ~ TIME + sin1 + cos1 + sin2 + cos2),
    reduced_linear_6  = TSLM(x ~ TIME + sin1 + cos1)
  )

glance(model_combined) |>
  select(.model, AIC, AICc, BIC)
Model AIC AICc BIC
full_cubic 523 526.1 575.1
full_quadratic 524.1 526.8 572.9
full_linear 528.2 530.6 573.8
reduced_quadratic_1 **517.4** **518.9** 553.3
reduced_quadratic_2 518.2 519.4 550.8
reduced_quadratic_3 520.7 521.7 550
reduced_quadratic_4 520 521.2 552.5
reduced_quadratic_5 523.5 524.2 **549.5**
reduced_quadratic_6 576.7 577.1 596.2
reduced_linear_1 521.5 522.7 554.1
reduced_linear_2 522.2 523.2 551.5
reduced_linear_3 524.5 525.3 550.6
reduced_linear_4 523.9 524.9 553.2
reduced_linear_5 527.2 527.8 550
reduced_linear_6 579 579.3 595.2

We look for the smallest value of the AIC, AICc, and BIC criteria. These methods do not have to agree with each other, and they provide different perspectives based on their various algorithms. Notice that the AIC and AICc criteria both suggest the model we titled “Reduced Quadratic 1:”

\[\begin{align*} x_t &= \beta_0 + \beta_1 \left( \frac{t - \mu_t}{\sigma_t} \right) + \beta_2 \left( \frac{t - \mu_t}{\sigma_t} \right)^2 + \beta_3 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \beta_4 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + \beta_5 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \beta_6 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + \beta_7 \sin \left( \frac{2\pi \cdot 3 t}{12} \right) + \beta_8 \cos \left( \frac{2\pi \cdot 3 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + \beta_9 \cos \left( \frac{2\pi \cdot 6 t}{12} \right) \phantom{+ \beta_9 \sin \left( \frac{2\pi \cdot 6 t}{12} \right)} + z_t \end{align*}\]

The BIC criteria points to the “Reduced Quadratic 5” model:

\[\begin{align*} x_t &= \beta_0 + \beta_1 \left( \frac{t - \mu_t}{\sigma_t} \right) + \beta_2 \left( \frac{t - \mu_t}{\sigma_t} \right)^2 + \beta_3 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \beta_4 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + \beta_5 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \beta_6 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) + z_t \end{align*}\]

If there are two competing models that are both satisfactory, it is usually preferable to choose the more parsimonious (or simpler) model.

Sometimes there will be a model that does not attain the lowest value of these measures, yet you may still want to use it if the AIC, AICc, and BIC values are not too much larger than the smallest values. Why might you do this? If the model is particularly interpretable or makes logical sense in the context of the physical situation, it is better than a model with a lower AIC that is not readily interpretable.

You may even choose to include terms that are not statistically significant, if you determine they are practically important. For example, if a quadratic term is significant, but the linear term is not, it is a good practice to include the linear term anyway.

Notice that the linear models corresponding to the “Reduced Quadratic 1” and “Reduced Quadratic 5” models have AIC/AICc/BIC values that are not much larger than the minimum values. Given other external evidence related to global warming, it is unlikely that the second derivative of the function representing the Earth’s mean temperature is negative. In other words, it does not seem like the rate at which the earth is warming is decreasing. Even if there is a quadratic component to the trend, it is not visible to the eye in the time plot.

For these reasons, we will apply the “Reduced Linear 5” model. This model implies a linear trend in the mean temperature of the Earth. The BIC for this model is not much bigger than the BIC for the “Reduced Quadratic 5” model. This model is simpler than the “Reduced Quadratic 1,” “Reduced Quadratic 5,” and “Reduced Linear 1” models.

\[\begin{align*} x_t &= \beta_0 + \beta_1 \left( \frac{t - \mu_t}{\sigma_t} \right) + \beta_2 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \beta_3 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + \beta_4 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \beta_5 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) + z_t \end{align*}\]

Model Fitting

Model Fitting

The model we have chosen is the “Reduced Linear 5” model. For convenience, we reprint the coefficients here.

Reduced Linear 5
Show the code
reduced5_linear_lm <- weather_df |>
  model(reduced_linear_5  = TSLM(x ~ zTIME + sin1 + cos1 + sin2 + cos2))

reduced5_linear_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 6 × 7
  .model           term        estimate std.error statistic   p.value sig  
  <chr>            <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_linear_5 (Intercept)   56.5       0.279    202.   5.06e-220 TRUE 
2 reduced_linear_5 zTIME          0.692     0.280      2.47 1.44e-  2 TRUE 
3 reduced_linear_5 sin1         -16.1       0.395    -40.7  1.66e- 94 TRUE 
4 reduced_linear_5 cos1         -23.7       0.395    -60.1  9.40e-124 TRUE 
5 reduced_linear_5 sin2           1.26      0.395      3.20 1.60e-  3 TRUE 
6 reduced_linear_5 cos2          -2.86      0.395     -7.24 1.13e- 11 TRUE 
Show the code
r5lin_coef_unrounded <- reduced5_linear_lm |>
  tidy() |>
  select(term, estimate, std.error)

r5lin_coef <- r5lin_coef_unrounded |>
  round_df(3)

The fitted model is therefore:

\[\begin{align*} x_t &= \hat \beta_0 + \hat \beta_1 \left( zTIME \right) \\ & ~~~~~~~~~~ + \hat \beta_2 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \hat \beta_3 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~ + \hat \beta_4 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \hat \beta_5 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= \hat \beta_0 + \hat \beta_1 \left( \frac{t - \bar t}{s_t} \right) \\ & ~~~~~~~~~~ + \hat \beta_2 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \hat \beta_3 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~ + \hat \beta_4 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \hat \beta_5 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= 56.467 + 0.692 \left( \frac{t - 96.5}{55.57} \right) \\ & ~~~~~~~~~~~~~~~~~ + (-16.068) \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + (-23.708) \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~ + 1.264 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + (-2.858) \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ \end{align*}\]

The linear trend term can be simplified, if desired, but it is not necessary.

\[\begin{align*} x_t &= 56.467 + 0.692 \left( \frac{t - 96.5}{55.57} \right) \\ & ~~~~~~~~~~~~~~~~~ + (-16.068) \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + (-23.708) \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~ + 1.264 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + (-2.858) \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= 56.467 + 0.692 \left( \frac{t}{55.57} \right) - 0.692 \left( \frac{96.5}{55.57} \right) \\ & ~~~~~~~~~~~~~~~~~ + (-16.068) \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + (-23.708) \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~ + 1.264 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + (-2.858) \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= 56.467 + 0.012 t - 1.201 \\ & ~~~~~~~~~~~~~~~~~ + (-16.068) \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + (-23.708) \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~ + 1.264 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + (-2.858) \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= 55.265 + 0.012 t \\ & ~~~~~~~~~~~~~~~~~ + (-16.068) \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + (-23.708) \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~ + 1.264 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + (-2.858) \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ \end{align*}\]

–>

–>

Reduced Quadratic 5

To illustrate how to incorporate the quadratic term, we also fit the model called “Reduced Quadratic 5”.

Show the code
reduced5_quadratic_lm <- weather_df |>
  model(reduced_quadratic_5  = TSLM(x ~ zTIME + I(zTIME^2) + sin1 + cos1 + sin2 + cos2))

reduced5_quadratic_lm |>
  tidy() |>
  mutate(sig = p.value < 0.05)
# A tibble: 7 × 7
  .model              term        estimate std.error statistic   p.value sig  
  <chr>               <chr>          <dbl>     <dbl>     <dbl>     <dbl> <lgl>
1 reduced_quadratic_5 (Intercept)   57.2       0.413    138.   1.06e-188 TRUE 
2 reduced_quadratic_5 zTIME          0.692     0.277      2.50 1.33e-  2 TRUE 
3 reduced_quadratic_5 I(zTIME^2)    -0.733     0.310     -2.37 1.90e-  2 TRUE 
4 reduced_quadratic_5 sin1         -16.1       0.390    -41.2  4.40e- 95 TRUE 
5 reduced_quadratic_5 cos1         -23.7       0.390    -60.8  3.16e-124 TRUE 
6 reduced_quadratic_5 sin2           1.26      0.390      3.24 1.40e-  3 TRUE 
7 reduced_quadratic_5 cos2          -2.86      0.390     -7.33 6.96e- 12 TRUE 
Show the code
r5quad_coef <- reduced5_quadratic_lm |>
  tidy() |>
  select(term, estimate, std.error) |>
  round_df(3)

The fitted model is:

\[\begin{align*} x_t &= \hat \beta_0 + \hat \beta_1 \left( zTIME \right) + \hat \beta_2 \left( zTIME \right)^2 \\ & ~~~~~~~~~~ + \hat \beta_3 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \hat \beta_4 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~ + \hat \beta_5 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \hat \beta_6 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= \hat \beta_0 + \hat \beta_1 \left( \frac{t - \bar t}{s_t} \right) + \hat \beta_2 \left( \frac{t - \bar t}{s_t} \right)^2 \\ & ~~~~~~~~~~ + \hat \beta_3 \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + \hat \beta_4 \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~ + \hat \beta_5 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + \hat \beta_6 \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ &= 57.196 + 0.692 \left( \frac{t - 96.5}{55.57} \right) + (-0.733) \left( \frac{t - 96.5}{55.57} \right)^2 \\ & ~~~~~~~~~~~~~~~~~ + (-16.067) \sin \left( \frac{2\pi \cdot 1 t}{12} \right) + (-23.705) \cos \left( \frac{2\pi \cdot 1 t}{12} \right) \\ & ~~~~~~~~~~~~~~~~~ + 1.265 \sin \left( \frac{2\pi \cdot 2 t}{12} \right) + (-2.857) \cos \left( \frac{2\pi \cdot 2 t}{12} \right) \\ \end{align*}\]

If we want, we could rewrite this by expanding out the polynomial in the first three terms, but it is not necessary.

Comparison of Fitted Values

Comparison of Fitted Values

This time plot shows the original temperature data superimposed with the fitted curves based on three models.

Show the code
num_months <- weather_df |> 
  as_tibble() |> 
  dplyr::select(TIME) |> 
  tail(1) |> 
  pull()

df <- tibble( TIME = seq(1, num_months, 0.01) ) |>
  mutate(
    cos1 = cos(2 * pi * 1 * TIME/12),
    cos2 = cos(2 * pi * 2 * TIME/12),
    cos3 = cos(2 * pi * 3 * TIME/12),
    cos4 = cos(2 * pi * 4 * TIME/12),
    cos5 = cos(2 * pi * 5 * TIME/12),
    cos6 = cos(2 * pi * 6 * TIME/12),
    sin1 = sin(2 * pi * 1 * TIME/12),
    sin2 = sin(2 * pi * 2 * TIME/12),
    sin3 = sin(2 * pi * 3 * TIME/12),
    sin4 = sin(2 * pi * 4 * TIME/12),
    sin5 = sin(2 * pi * 5 * TIME/12),
    sin6 = sin(2 * pi * 6 * TIME/12)) |>
  mutate(zTIME = (TIME - mean(TIME)) / sd(TIME)) |>
  as_tsibble(index = TIME)

quad1_ts <- reduced1_quadratic_lm |>
  forecast(df) |>
  as_tibble() |>
  dplyr::select(TIME, .mean) |>
  rename(value = .mean) |>
  mutate(Model = "Quadratic 1")

quad5_ts <- reduced5_quadratic_lm |>
  forecast(df) |>
  as_tibble() |>
  dplyr::select(TIME, .mean) |>
  rename(value = .mean) |>
  mutate(Model = "Quadratic 5")

linear5_ts <- reduced5_linear_lm |>
  forecast(df) |>
  as_tibble() |>
  dplyr::select(TIME, .mean) |>
  rename(value = .mean) |>
  mutate(Model = "Linear 5")

data_ts <- weather_df |> 
  as_tibble() |>
  rename(value = x) |>
  mutate(Model = "Data") |>
  dplyr::select(TIME, value, Model)

combined_ts <- bind_rows(data_ts, quad1_ts, quad5_ts, linear5_ts) 
point_ts <- combined_ts |> filter(TIME == floor(TIME))

okabe_ito_colors <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

combined_ts |>
  ggplot(aes(x = TIME, y = value, color = Model)) +
  geom_line() +
  geom_point(data = point_ts) +
  labs(
      x = "Month Number",
      y = "Temperature (Fahrenheit)",
      title = "Monthly Average of Daily High Temperatures in Rexburg",
      subtitle = paste0("(", format(weather_df$dates %>% head(1), "%b %Y"), endash, format(weather_df$dates %>% tail(1), "%b %Y"), ")")
  ) +    
  scale_color_manual(
    values = okabe_ito_colors[1:nrow(combined_ts |> as_tibble() |> select(Model) |> unique())], 
    name = ""
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "top", # Position the legend at the top
    legend.direction = "horizontal" # Set the legend direction to horizontal
  )

Small-Group Activity: Fall River Flow Rate (35 min)

The Fall River is a tributary of the Henrys Fork of the Snake River northeast of Rexburg, Idaho. The United States Geological Survey (USGS) provides data every fifteen minutes on the flow rate (in cubic feet per second, cfs) of this river.

This map shows the location of the monitoring station. On the left, you can see where Highway 20 intersects with Main St. in Ashton, Idaho.

Map of the fall river monitoring station Here is a glimpse of the raw data:

agency_cd site_no datetime flow code
USGS 13046995 1998-10-01 00:00 MDT 662 A:[91]
USGS 13046995 1998-10-01 00:15 MDT 655 A:[91]
USGS 13046995 1998-10-01 00:30 MDT 655 A:[91]
USGS 13046995 1998-10-01 00:45 MDT 662 A:[91]
USGS 13046995 2023-09-30 23:00 MDT 547 A
USGS 13046995 2023-09-30 23:15 MDT 547 A
USGS 13046995 2023-09-30 23:30 MDT 547 A
USGS 13046995 2023-09-30 23:45 MDT 552 A

The 2023 water year goes from October 1, 2022 to September 30, 2023. The data starts with the 1999 water year and goes through the 2023 water year.

We will average the flow rates across the first and last half of each month, so there are 24 values in a year. For various reasons, there are gaps in the data, even after averaging across the first and last half of the month. We impute the missing values by linear interpolation using the zoo package. This just means that when we encounter NAs, we fit a line between the most recent observed value and the next observed value. Then, we fill all the NAs in with the value given by the liner relationship between the two points.

Here are a few of the values, after averaging:

Show the code
# load necessary packages
pacman::p_load(zoo) # for linear interpolation of missing values

# create the tsibble
fallriver_ts0 <- rio::import("https://byuistats.github.io/timeseries/data/fallriver.parquet") |>
  mutate(date = ymd_hm(datetime)) |>
  dplyr::select(datetime, date, flow) |>
  mutate(dates = ymd(paste0(year(date), "/", month(date), "/", day(date)))) |>
  group_by(dates) |>
  summarize(flow = mean(flow), .groups = "keep") |>
  ungroup() |> 
  as_tsibble(index = dates) |>
  fill_gaps() |> 
  read.zoo() %>% 
  # impute missing values by interpolation
  na.approx(xout = seq(start(.), end(.), "day")) %>% 
  fortify.zoo() |>
  rename(
    flow = ".", 
    dates = Index
  ) |>
  mutate(round_day = ifelse(day(dates) <= 15, 1, 16)) |>
  mutate(dates2 = ymd(paste0(year(dates), "/", month(dates), "/", round_day))) |>
  group_by(dates2) |>
  summarize(flow = mean(flow)) |>
  ungroup() |> 
  rename(dates = dates2) |>
  as_tsibble(index = dates, regular = FALSE)
dates flow
1998-10-01 634.682
1998-10-16 596.352
1998-11-01 578.583
1998-11-16 590.703
2023-08-01 751.708
2023-08-16 634.236
2023-09-01 625.772
2023-09-16 584.168
Show the code
fallriver_ts <- fallriver_ts0 |>
  # compute additional variables needed for the regression
  mutate(TIME = 1:n()) |>
  mutate(
    cos1 = cos(2 * pi * 1 * TIME / 24),
    cos2 = cos(2 * pi * 2 * TIME / 24),
    cos3 = cos(2 * pi * 3 * TIME / 24),
    cos4 = cos(2 * pi * 4 * TIME / 24),
    cos5 = cos(2 * pi * 5 * TIME / 24),
    cos6 = cos(2 * pi * 6 * TIME / 24),
    cos7 = cos(2 * pi * 7 * TIME / 24),
    cos8 = cos(2 * pi * 8 * TIME / 24),
    cos9 = cos(2 * pi * 9 * TIME / 24),
    cos10 = cos(2 * pi * 10 * TIME / 24),
    cos11 = cos(2 * pi * 11 * TIME / 24),
    cos12 = cos(2 * pi * 12 * TIME / 24),
    sin1 = sin(2 * pi * 1 * TIME / 24),
    sin2 = sin(2 * pi * 2 * TIME / 24),
    sin3 = sin(2 * pi * 3 * TIME / 24),
    sin4 = sin(2 * pi * 4 * TIME / 24),
    sin5 = sin(2 * pi * 5 * TIME / 24),
    sin6 = sin(2 * pi * 6 * TIME / 24),
    sin7 = sin(2 * pi * 7 * TIME / 24),
    sin8 = sin(2 * pi * 8 * TIME / 24),
    sin9 = sin(2 * pi * 9 * TIME / 24),
    sin10 = sin(2 * pi * 10 * TIME / 24),
    sin11 = sin(2 * pi * 11 * TIME / 24),
    # sin12 = sin(2 * pi * 12 * TIME / 24) # zero for all integer values of t
  ) 

# plot the time series
fallriver_ts |>
  autoplot(.vars = flow) +
  labs(
      x = "Date",
      y = "Flow (cubic feet per second, cfs)",
      title = "Fall River Flow Rate",
      subtitle = "Above the Yellowstone Canal near Squirrel, Idaho"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

Check Your Understanding

Use the Fall River flow data to do the following.

  • Explore various linear models for the flow of the Fall River over time.
  • Choose a model to represent the data, and justify your decision.

Homework Preview (5 min)

  • Review upcoming homework assignment
  • Clarify questions
Download Homework

Small-Group Activity