General Directions

We are going to try to do some analysis that requires confidence intervals. With your group work out the best analysis and visualization. Make a two slide presentation that has a plot on one slide and your numerical analysis output and interpretation on the second slide.

How strong is your metal?

Background

On page 258 of Akritas he introduces the Charpy impact test. It measures the amount of energy absorbed by a metal during fracture. This video demonstrates the technique. NIST still uses this method[1][2].

We will use the Akritas data and some equipment validation data from NIST. They have an interesting plot here as well.

Your Job

Your employer wants to sell a tool for this type of test to the industry. The NIST HH evaluation data shows the permissible manufacturing standards based on 187 different evaluations using their test specimens. We will largely be using the mean and std_dev columns for our work.

  1. If we are validating the quality of our new equipment, which estimate is more important for us to use as a benchmark?
  2. Do you think that there is a relationship between the energy absorption of the material and variability of the repeated tests (n=5)?
  3. Your employer wants to know what standard deviation he is expected to have on repeated measurements of the same test specimen type. What can you report to him? Report this interval to him.
    • Report the prediction interval for his next machine built?
    • Report the confidence interval on the median?
  4. Take the intervals in problem 3 and write them out in a statistically correct statement that your non-statistician boss can understand.
x = c(4.9,3.38,3.32,2.38,3.14,2.97,3.87,3.39,2.97,3.45,3.35,4.34,3.54,2.46,4.38,2.92)

nistHH = read.csv("http://www.nist.gov/sites/default/files/documents/mml/acmd/structural_materials/HH-105-48877.csv",stringsAsFactors = FALSE,skip = 2)

#nistSH = read.csv("https://www.nist.gov/sites/default/files/documents/mml/acmd/structural_materials/SH-31-46485.csv",stringsAsFactors = FALSE, skip=2)

#nistLL = read.csv("https://www.nist.gov/sites/default/files/documents/mml/acmd/structural_materials/LL-118-54385.csv",stringsAsFactors = FALSE,skip=2)

Some Code

nistHH.lm = lm(nistHH$std_dev~1,data=nistHH)
# This will provide a confidence interval on the mean
confint(nistHH.lm)
# or we could use the predict statement as well
predict(nistHH.lm,data.frame(1),interval="confidence")

# Now we want to calculate an interval for the median
#install.packages("BSDA")
library(BSDA)
# See page 267 of Akritas for a description.
SIGN.test(nistHH$std_dev,alternative = "two.sided",conf.level=.95)

### For an extra push  ####
## We could use the bootsrap to provide a confidence interval for the mean
# https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading24.pdf
# Percentile Bootsrap
bs5000 = replicate(5000,mean(sample(nistHH$std_dev,length(nistHH$std_dev),replace=TRUE)))
quantile(bs50000,c(.025,.975))

### a more accurate method is as follows. It is called the empirical bootstrap

bs5000_dev = replicate(5000,mean(nistHH$std_dev)-mean(sample(nistHH$std_dev,length(nistHH$std_dev),replace=TRUE)))
mean(nistHH$std_dev)+quantile(bs5000_dev,c(.025,.975))

### And do it for the median as well

bs5000_dev_median = replicate(5000,median(nistHH$std_dev)-median(sample(nistHH$std_dev,length(nistHH$std_dev),replace=TRUE)))
median(nistHH$std_dev)+quantile(bs5000_dev_median,c(.025,.975))

Rexburg Housing

Background

From the homework we had some Rexburg housing market data. Data was collected on homes for sale in Madison County as of January 2011. Information on the listings such as price, size of the home, and style were recorded. Open the data file MadisonCountyRealEstate.

Your Job

Your construction company is looking for new cities to expand the market. It is not financially feasible to get into a market unless they can sell houses for over $80 a square foot. They would like you to analyze the data from Rexburg to decide if the market is worth it. They mentioned that they are not in the market to build houses larger than 4,200 ft2 or smaller than 2,000 ft2.

  1. Explain two different ways you could use the Madison data to make an estimate of the square footage estimates (Hint read the extra code below).
  2. Pick a method and calculate your point estimate.
  3. Report your confidence interval to them and provide an correct and clear statement about that interval.
library(readr)
library(ggplot2)
Madison = read.csv(file = "http://raw.githubusercontent.com/byuistats/data/master/MadisonCountyRealEstate/MadisonCountyRealEstate.csv", header = TRUE, stringsAsFactors = FALSE)

Leftover Code

Madison.subset = subset(Madison,SQFT>=2000 & SQFT<=4000)
qplot(data=Madison.subset,x=SQFT,y=ListPrice)+geom_smooth(method="lm")
qplot(data=Madison.subset,x=ListPrice/SQFT)
qplot(data=Madison.subset,y=ListPrice/SQFT,x=SQFT)
madison.lm1 = lm(ListPrice~SQFT,data=Madison.subset)
madison.lm2 = lm(ListPrice/SQFT~1,data=Madison.subset)
confint(madison.lm1)
confint(madison.lm2)