The big project

Nonprofit organizations receive a unique benefit from the US Government as they are not required to pay taxes. The Internal Revenue Service created Form 990 for these organizations to prevent them from abusing their tax-exempt status.

We will use varied data sets built from these Form 990 filings throughout the semester to build skills with PySpark, SparkSQL, and SparkML.

When: Starts week 1.
Duration: This project will flow throughout the entire semester.
How: We will work in 3-5 member teams, but we will work together as a class.

Finally, your team will present to the entire class at each step. As a full ‘company,’ we will then agree on a standard path for all groups to move to the next step.

Step 0: Deciding on a cloud compute engine

5-team, 3-days

We need to decide whether we should use AWS, Azure, or Google Cloud. Create a compare and contract presentation that is about 5 minutes long. We will make a decision as a class.

Step 1: Exploring 990 Tax Forms and Nonprofits

5-team, 3-days

Create a 4-6 minute presentation on Nonprofit entities and the IRS 990 Form. Imagine you are the leader of a data science team, and your group will start working with this data. You need to provide them with enough background, so they understand what it entails. Additionally, you need to find available data sources that we can use for the project.

Step 2: Connecting to the Data

5-team, 3-days

Now that we have our data, we need to get our Docker data science environment connected to the data and provide a few descriptions of what we have. Data Source

How many tables do we have?
Which tables seem to be the most important?
Are we missing anything?

Step 3: Matching other’s Stories

5-team, 9-days

The Foundation Trustee Fees: Use and Abuse report, The Nonprofit Sector in Brief and Fast Company through Civis provide some analysis and data summaries that we can try to recreate to validate our data source.

Step 4: Telling our own story

5-team, 18-days

This element has two pieces - 1) Story proposals & 2) The analysis.

Story Proposals

Will take 3-5 of the days

With your team, work through the data looking for a good data science story that our team could tell with the data. You will build a proposal you would like the entire class to work on for the analysis.

Analysis

Will take 13-15 of the days

Start working out the analysis and story from the database. The report you are going to provide must include the following elements.

Well documented final scripts.
An executive summary of your results to present in class (4-8 minutes).
A detailed report for non-technical readers.
An appendix that supports the detailed report for technical readers.

Step 5: Reporting Results

5-team, 3-days

Each team will be responsible for drafting a short article that we can publish on nonprofits based on their technical report. Think of this as a short blog-post that you can use on your personal website. The class will pick the best one from all the groups to highlight

The small projects

As we get into this course’s tools, we will have 3-8 ‘small projects’ done with a 2-3 person group where each of you must commit to working equally on the task.

When: Aligned with different tools during the semester.
Duration: Each project will last 3-7 days.
How: In groups of 2-3, the work will be presented to another team.

At some point, we need to Docker

Docker is used everywhere the cloud is used. Both data scientists and computer scientists leverage Docker in their workflows. We will gain familiarity with the docker process but primarily use Docker as a pre-developed tool to access tools like Jupyter Notebooks, Spark, Python, and R without needing to configure our computers.

Docker 101

Partner, 3-days

Complete the Docker 101 Tutorial and put together a one-slide overview of what Docker is and how it is used.

Figuring out Docker for Data Science

3-Team, 5-days

We will use the all-spark-notebook to create our data science container where we will examine the IRS 990 Master Files. You can download the files from the links in the readme or use your BYUI login to Google drive to see the files in Google Drive.

The IRS provides a few Master Files that list all Exempt Organizations with over 2 million organizations listed.

Complete an exploratory analysis that provides a summary and story of the data provided. This report should include at least ten charts, three tables, and multiple paragraphs.

Python and pandas
R and dplyr

Finding that Spark

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.

The fast part means that it’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on memory (RAM), making the processing much faster than on disk drives.

The general part means that it can be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, working with graphs or data streams, and much more. ref

Spark and SQL

3-Team, 5-days

Repeat the Figuring out Docker for Data Science project using;

pyspark and sparksql
sparklyr or sparkr

Machine Learning with Spark (MLib)

3-Team, 5-days

We will work through two examples to make sure we understand how Machine Learning works in Spark.

Brick by Databricks

We are going to use Azure Databricks to experience Spark in the cloud.

Making sparks with bricks

Redoing our Master File Project with Cloud Compute

Connecting our IRS 990 Subset to DataBricks

References

Airline Delays and Weather

Amazon Customer Reviews

Yelp Customer Reviews

Kaggle: Yelp Dataset

Health Insurance Marketplace

Kaggle: Health Insurance Marketplace

NYC Taxi Data

TLC Trip Record Data

Clothing Data Set

Public data from San Francisco

Link

Credit Card Fraud

Credit Card Fraud Detection

Projects

The big project

Step 0: Deciding on a cloud compute engine

Step 1: Exploring 990 Tax Forms and Nonprofits

Step 2: Connecting to the Data

Step 3: Matching other’s Stories

Step 4: Telling our own story

Story Proposals

Analysis

Step 5: Reporting Results

The small projects

At some point, we need to Docker

Docker 101

Figuring out Docker for Data Science

Finding that Spark

Spark and SQL

Machine Learning with Spark (MLib)

Brick by Databricks

Making sparks with bricks

Redoing our Master File Project with Cloud Compute

Connecting our IRS 990 Subset to DataBricks

References