Projects

The big project

Nonprofit organizations receive a unique benefit from the US Government as they are not required to pay taxes. The Internal Revenue Service created Form 990 for these organizations to prevent them from abusing their tax-exempt status.

We will use varied data sets built from these Form 990 filings throughout the semester to build skills with PySpark, SparkSQL, and SparkML.

Finally, your team will present to the entire class at each step. As a full ‘company,’ we will then agree on a standard path for all groups to move to the next step.

Step 0: Deciding on a cloud compute engine

We need to decide whether we should use AWS, Azure, or Google Cloud. Create a compare and contract presentation that is about 5 minutes long. We will make a decision as a class.

Step 1: Exploring 990 Tax Forms and Nonprofits

Create a 4-6 minute presentation on Nonprofit entities and the IRS 990 Form. Imagine you are the leader of a data science team, and your group will start working with this data. You need to provide them with enough background, so they understand what it entails. Additionally, you need to find available data sources that we can use for the project.

Step 2: Connecting to the Data

Now that we have our data, we need to get our Docker data science environment connected to the data and provide a few descriptions of what we have. Data Source

Step 3: Matching other’s Stories

The Foundation Trustee Fees: Use and Abuse report, The Nonprofit Sector in Brief and Fast Company through Civis provide some analysis and data summaries that we can try to recreate to validate our data source.

Step 4: Telling our own story

This element has two pieces - 1) Story proposals & 2) The analysis.

Story Proposals

Will take 3-5 of the days

With your team, work through the data looking for a good data science story that our team could tell with the data. You will build a proposal you would like the entire class to work on for the analysis.

Analysis

Will take 13-15 of the days

Start working out the analysis and story from the database. The report you are going to provide must include the following elements.

  1. Well documented final scripts.
  2. An executive summary of your results to present in class (4-8 minutes).
  3. A detailed report for non-technical readers.
  4. An appendix that supports the detailed report for technical readers.

Step 5: Reporting Results

Each team will be responsible for drafting a short article that we can publish on nonprofits based on their technical report. Think of this as a short blog-post that you can use on your personal website. The class will pick the best one from all the groups to highlight

The small projects

As we get into this course’s tools, we will have 3-8 ‘small projects’ done with a 2-3 person group where each of you must commit to working equally on the task.

At some point, we need to Docker

Docker is used everywhere the cloud is used. Both data scientists and computer scientists leverage Docker in their workflows. We will gain familiarity with the docker process but primarily use Docker as a pre-developed tool to access tools like Jupyter Notebooks, Spark, Python, and R without needing to configure our computers.

Docker 101

Complete the Docker 101 Tutorial and put together a one-slide overview of what Docker is and how it is used.

Figuring out Docker for Data Science

We will use the all-spark-notebook to create our data science container where we will examine the IRS 990 Master Files. You can download the files from the links in the readme or use your BYUI login to Google drive to see the files in Google Drive.

The IRS provides a few Master Files that list all Exempt Organizations with over 2 million organizations listed.

Complete an exploratory analysis that provides a summary and story of the data provided. This report should include at least ten charts, three tables, and multiple paragraphs.

  1. Python and pandas
  2. R and dplyr

Finding that Spark

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.

The fast part means that it’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on memory (RAM), making the processing much faster than on disk drives.

The general part means that it can be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, working with graphs or data streams, and much more. ref

Spark and SQL

Repeat the Figuring out Docker for Data Science project using;

  1. pyspark and sparksql
  2. sparklyr or sparkr

Machine Learning with Spark (MLib)

We will work through two examples to make sure we understand how Machine Learning works in Spark.

Brick by Databricks

We are going to use Azure Databricks to experience Spark in the cloud.

Making sparks with bricks

Redoing our Master File Project with Cloud Compute

Connecting our IRS 990 Subset to DataBricks

References

Airline Delays and Weather

Amazon Customer Reviews

Yelp Customer Reviews

Health Insurance Marketplace

NYC Taxi Data

Clothing Data Set

Public data from San Francisco

Credit Card Fraud