From Automatic Finances we read,
In his popular personal finance book arguing that investors can’t consistently beat the market (A Random Walk Down Wall Street), economist Burton Malkiel says that “a blindfolded monkey throwing darts at a newspaper’s financial pages could select a portfolio that would do just as well as one carefully selected by experts.”
Sounds like a challenge.
So, in 1988, the Wall Street Journal decided to see if Malkiel’s theory would hold up, and created the Dartboard Contest.
How it worked: Wall Street Journal staffers, acting as the monkeys, threw darts at a stock table, while investment experts picked their own stocks. After six months, they compared the results of the two methods. The WSJ even solicited stock picks from some of its readers, and compared them, too.
We have the data to compare the performance of the two groups. Let’s compare their performance
# might need https
temp = read.csv("http://github.com/byuistats/data/raw/master/Dart_Expert_Dow_6month_anova/Dart_Expert_Dow_6month_anova.csv")
sdata = subset(temp,variable!="DJIA")
datatable(sdata)
In many circumstances, statistical tests are not needed to tell the difference between the means of two groups. A good visualization (or statistical plot) can convey meaning much stronger than the results of a two-sample statistical test.
While statistics can be confusing, plots of raw data can be misleading or may not be able to distinguish differences visually. If the graphics are not clear cut, then statistics is needed to state with confidence that a true difference exists.
Below are some examples of two groups of data simulated to have a known true difference. Look through them and answer the following questions
two_groups = function(diff=1,sd=3,n=15,return_data=FALSE,mean=52,ci=FALSE){
g1 = round(rnorm(n=n,mean=mean,sd=sd),2)
g2 = round(rnorm(n=n,mean=mean+diff,sd=sd),2)
gdata = data.frame(group=rep(c("Group 1","Group 2"),each=n),value=c(g1,g2))
mean1 = mean(g1)
mean2 = mean(g2)
interval1 = as.numeric(t.test(g1)$conf.int)
interval2 = as.numeric(t.test(g2)$conf.int)
intervalD = t.test(g1,g2)$conf.int
print(intervalD)
print(paste("True Difference is", diff))
conf_data = data.frame(group=c("Group 1","Group 2"),
value=c(mean1,mean2),
lower=c(interval1[1],interval2[1]),
upper=c(interval1[2],interval2[2]))
p = ggplot(data=gdata,aes(x=group,y=value))+
geom_boxplot(outlier.size=NA)+
geom_jitter(height=0,width=.2,size=2,colour="darkred",shape=15)+
labs(x="Group",y="Measurement")
if(ci==TRUE) p = p + geom_pointrange(data=conf_data,aes(ymin=lower,ymax=upper),colour="darkgreen",lwd=1.25)
out = list(data=gdata,plot=p)
if (return_data==FALSE) out = p
out
}
two_groups(diff=10,ci=FALSE)
## [1] -12.015429 -8.119238
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 10"
two_groups(diff=8)
## [1] -6.774220 -2.831113
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 8"
out = two_groups(diff=5,ci=TRUE,return_data=TRUE)
## [1] -7.168952 -2.067048
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 5"
out$plot
datatable(out$data,options = list(pageLength=2,lengthMenu=c(2,5,15)),
caption = 'Table 1: Data for Random sample with a true mean difference of 5.', rownames=FALSE)
paste(subset(out$data,group=="Group 1")$value,collapse=",")
## [1] "49.67,48.05,58.31,53.14,51.33,56.66,51.49,55.85,48.99,44.69,51.88,51.91,48.38,56.4,55.24"
paste(subset(out$data,group=="Group 2")$value,collapse=",")
## [1] "56,57.62,59.76,49.24,58.25,54.38,58.87,57.42,58.22,58,58.84,52.47,55.62,60.16,56.41"
two_groups(diff=3)
## [1] -4.93925414 0.02192081
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 3"
two_groups(diff=2)
## [1] -4.7241376 -0.2251957
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 2"
two_groups(diff=1)
## [1] -5.699707 -0.497626
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 1"
two_groups(diff=.5)
## [1] -2.099546 2.390213
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 0.5"
two_groups(diff=10,n=250)
## [1] -10.802209 -9.734271
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 10"
two_groups(diff=8,n=250)
## [1] -8.749091 -7.668349
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 8"
two_groups(diff=5,n=250)
## [1] -5.676559 -4.628561
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 5"
two_groups(diff=3,n=250)
## [1] -3.455734 -2.366426
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 3"
two_groups(diff=2,n=250,ci=TRUE)
## [1] -2.541285 -1.515675
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 2"
two_groups(diff=1,n=250,ci=TRUE)
## [1] -1.7092729 -0.6068871
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 1"
two_groups(diff=.5,n=250,ci=TRUE)
## [1] -1.171691 -0.109989
## attr(,"conf.level")
## [1] 0.95
## [1] "True Difference is 0.5"
The average temperature on Earth is about 61 degrees F (16 C). But temperatures vary greatly around the world depending on the time of year, ocean and wind currents and weather conditions. Summers tend to be warmer and winters colder. Also, temperatures tend to be higher near the equator and lower near the poles[1].
Thanks to Kaggle we have some data that we can use. It looks like there are different ways to calculate the average earth temperature but the important thing is the change.
climate = read.csv(file="http://byuistats.github.io/M330/data/USALandTemperatures.csv",stringsAsFactors = FALSE)
years1=unlist(lapply(strsplit(climate$dt,"-"),function(x) x[1]))
years2=unlist(lapply(strsplit(climate$dt,"/"),function(x) x[3]))
months1 = unlist(lapply(strsplit(climate$dt,"-"),function(x) x[2]))
months2 = unlist(lapply(strsplit(climate$dt,"/"),function(x) x[1]))
years1[is.na(months1)] = years2[is.na(months1)]
months1[is.na(months1)] = months2[is.na(months1)]
climate$year = as.numeric(years1)
climate$month = as.numeric(months1)
climate_year = as.data.frame(climate%>%group_by(year)%>%summarise(mean=mean(AverageTemperature.f)))
sd(climate_year$mean)
hist(climate$AverageTemperature.f,breaks=45)
ggplot(data=climate,aes(x=AverageTemperature.f))+
geom_histogram()+
facet_wrap(~month)+
theme_bw()