CSE 451: Big Data Programming & Analytics

Applications using Spark, Databricks, and Docker in the Cloud for Data Science.

Outcomes:

By the end of the semester, each student will be able to:

  1. Integrate and extend previously learned data science tools to analyze remote and distributed data in business contexts.
  2. Explore, interpret, conceptualize, and validate assumptions of data at scale.
  3. Understand the differences and benefits of current industry technologies for big data storage and analysis.
  4. Leverage parallel processing for analysis.

Integrate and extend …

Integrate and extend previously learned data science tools to analyze remote and distributed data in business contexts.

There are many environments and tools for big data analysis. Companies like Oracle and SAS have been around for decades. Microsft Azure, Amazon’s AWS, and Google Cloud now have an established presence in this space.

Historically, each of those systems had its own language and proprietary data storage architecture. Now we have Platform as a service (PaaS) tools that leverage AWS and Azure to meet the new demand for decision making with data using R, Python, and SQL. Databricks https://databricks.com/ is the PaaS that we will leverage to learn Apache Spark for big data analytics.

Spark and Spark SQL allow us to use our Python and R languages to access cloud computing for big data.

Data at scale

Explore, interpret, conceptualize, and validate assumptions of data at scale.

Topics around big data are often summarized with V words. Early on, it was the 3 V’s of big data, and then it went to the 4 V’s of Big Data. From there we went to 7, to 10, and even 42 V’s.

Different stresses arise with big data. We are forced to rethink how we can validate our work, debug our code, or even conceptualize our project’s data and process.1

Differences and benefits

Understand the differences and benefits of current industry technologies for big data storage and analysis.

Once you move into cloud computing and big data tools, there are so many technologies and companies used and talked about that it is hard to keep up. We hear things like Azure ML, Amazon S3, Amazon EC2, Hadoop, HDFS, Docker, Kubernetes, Spark, Google Big Query, Apache, Cloudera, Databricks and wonder if we can ever keep up with all the tools.

We will see how some of these tools can be leveraged with our data science languages of R, Python, and SQL.

Parallel processing

Leverage parallel processing for analysis.

Both R and Python require additional packages to leverage most of the functionality associated with parallel processing. In R, we have future, foreach, and furrr to name a few. In Python, we have multiprocessing, Ray, and Dask that are often used.

Many of the concepts associated with computing in the cloud align with performing parallel processing on your personal computer. We need to be careful to think about data transfer, memory size as we try to speed up our calculations.


  1. https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/

    ↩︎