Day 11: Sometimes the SQL is better

Let’s look out our next project

  • What is the purpose of the project?
  • What tools are we learning?

My history with SQL and NoSQL

  1. In 2005 I started work at Pacific Northwest National Laboratory, having a master’s in statistics completed and never having seen the SQL language.
  2. In late 2005, I had to pull data from a SQL database for use in R related to our airline network. It wasn’t fun building a tidy table from a SQL database with no background in SQL. I am on the team that wins the R&D 100 Award based primarily on work by Brett Amidan.
  3. In 2012 Ryan Hafen and Jeremiah Rounds introduced me to Hadoop and the concept of key-value databases, and I am enamored.
  4. From 2013 - 2015, I am in deep with Hadoop, using it to answer climate science problems using data well over terabytes in size.
  5. I start developing the data science degree in 2016 with Scott Burton and Brent Morring, and I wonder if SQL is needed in the program. Upon researching, I realize that SQL as a language is having a resurgence regardless of the back-end database.
  6. In 2017 we used CIT 111 and CIT 225 in the DS program to give students a SQL background.
  7. In 2020 we introduce CSE 250 and CSE 451 to provide DS students more access to the SQL language for data science applications.

Google’s influence on Big Data

Big Data timeline

  1. 2004 Paper on MapReduce released by Google
  2. 2012 Hadoop 1.0 released for use.
  3. 2017 Hadoop hype has come and gone.
  4. Beyond 2017 the rise of new scalable databases that embrace SQL - Spark SQL

What is structured Query Language (SQL)?

History of SQL

Ray and I were impressed by how compactly Codd’s languages could represent complex queries. However, at the same time, we believed that it should be possible to design a relational language that would be more accessible to users without formal training in mathematics or computer programming. We believed that barriers to widespread acceptance of Codd’s languages existed on two levels.
.

  1. The first barrier came from the mathematical notation, which was hard to enter at a keyboard. This barrier was superficial and could be easily dealt with by replacing symbols with keywords.

  2. The more difficult barrier was at the semantic level. The basic concepts of Codd’s languages were adapted from set theory and symbolic logic. This was natural given Codd’s background as a mathematician, but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users. We also hoped to extend the language to encompass database updates and administrative tasks such as the creation of new tables and views, which had traditionally been outside the scope of a query language.

but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users.

When we moved to the San Jose Research Laboratory in 1973 to join the System R project, we began work on another new language that we called Sequel. Sequel allowed the well-paid-employee query to be represented in a readable form free from mathematical concepts and symbols. … In 1977, because of a trademark issue, the name Sequel was shortened to SQL.

reference

How does it work then?

From EverSQL we can get some background.

This is the logical order of operations, also known as the order of execution, for an SQL query:

  1. FROM, including JOINs
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. WINDOW functions
  6. SELECT
  7. DISTINCT
  8. UNION
  9. ORDER BY
  10. LIMIT and OFFSET

But the reality isn’t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Modern databases are already challenging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order.

Don’t think too hard about optimization at this point. Let the database figure out the optimized routine. If we were in Pandas, we would need to think about the optimized order.

Most SQL queries are typed in the following pattern;

SELECT -- <columns> and <column calculations>
FROM -- <table name>
  JOIN -- <table name>
  ON -- <columns to join>
WHERE -- <filter condition on rows>
GROUP BY -- <subsets for column calculations>
HAVING -- <filter conditions on groups>
ORDER BY -- <how the output is returned in sequence>
LIMIT -- <number of rows to return>

Connecting to data.world in Python

Let’s make the connection

Class reading

Let’s view our tables

data.world basebal data

import datadotworld as dw

results = dw.query('byuidss/cse-250-baseball-database', 
    'SELECT * FROM batting LIMIT 5')

batting5 = results.dataframe

I want to do a calculation in SQL and return it in a new column in Python?

Use the batting table to show the player and his team with his at batts and runs together with a calculated value of ab / r that is called runs_atbat.

  • Try do complete the above statement without using the info in the questions below.
q = '''
SELECT *
FROM batting
LIMIT 5
'''

dw.query('byuidss/cse-250-baseball-database', q).dataframe

q = '''
SELECT playerid, teamid, ab, r
FROM batting
LIMIT 5
'''

dw.query('byuidss/cse-250-baseball-database', q).dataframe

q = '''
SELECT playerid, teamid, ab, r, ab/r 
FROM batting
LIMIT 5
'''

batting_calc = dw.query('byuidss/cse-250-baseball-database', q).dataframe

q = '''
SELECT playerid, teamid, ab, r, ab/r as runs_atbat
FROM batting
LIMIT 5
'''

batting_calc = dw.query('byuidss/cse-250-baseball-database', q).dataframe