Nonprofit organizations receive a unique benefit from the US Government as they are not required to pay taxes. The Internal Revenue Service created Form 990 for these organizations to prevent them from abusing their tax-exempt status.
We will use varied data sets built from these Form 990 filings throughout the semester to build skills with PySpark, SparkSQL, and SparkML.
Finally, your team will present to the entire class at each step. As a full ‘company,’ we will then agree on a standard path for all groups to move to the next step.
- 5-team, 3-days
We need to decide whether we should use AWS, Azure, or Google Cloud. Create a compare and contract presentation that is about 5 minutes long. We will make a decision as a class.
- 5-team, 3-days
Create a 4-6 minute presentation on Nonprofit entities and the IRS 990 Form. Imagine you are the leader of a data science team, and your group will start working with this data. You need to provide them with enough background, so they understand what it entails. Additionally, you need to find available data sources that we can use for the project.
- 5-team, 3-days
Now that we have our data, we need to get our Docker data science environment connected to the data and provide a few descriptions of what we have. Data Source
- 5-team, 9-days
The Foundation Trustee Fees: Use and Abuse report, The Nonprofit Sector in Brief and Fast Company through Civis provide some analysis and data summaries that we can try to recreate to validate our data source.
- 5-team, 18-days
This element has two pieces - 1) Story proposals & 2) The analysis.
Will take 3-5 of the days
With your team, work through the data looking for a good data science story that our team could tell with the data. You will build a proposal you would like the entire class to work on for the analysis.
Will take 13-15 of the days
Start working out the analysis and story from the database. The report you are going to provide must include the following elements.
- 5-team, 3-days
Each team will be responsible for drafting a short article that we can publish on nonprofits based on their technical report. Think of this as a short blog-post that you can use on your personal website. The class will pick the best one from all the groups to highlight
As we get into this course’s tools, we will have 3-8 ‘small projects’ done with a 2-3 person group where each of you must commit to working equally on the task.
Docker is used everywhere the cloud is used. Both data scientists and computer scientists leverage Docker in their workflows. We will gain familiarity with the docker process but primarily use Docker as a pre-developed tool to access tools like Jupyter Notebooks, Spark, Python, and R without needing to configure our computers.
Complete the Docker 101 Tutorial and put together a one-slide overview of what Docker is and how it is used.
We will use the all-spark-notebook to create our data science container where we will examine the IRS 990 Master Files. You can download the files from the links in the readme or use your BYUI login to Google drive to see the files in Google Drive.
The IRS provides a few Master Files that list all Exempt Organizations with over 2 million organizations listed.
Complete an exploratory analysis that provides a summary and story of the data provided. This report should include at least ten charts, three tables, and multiple paragraphs.
Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.
The fast part means that it’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on memory (RAM), making the processing much faster than on disk drives.
The general part means that it can be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, working with graphs or data streams, and much more. ref
Repeat the Figuring out Docker for Data Science project using;
We will work through two examples to make sure we understand how Machine Learning works in Spark.
We are going to use Azure Databricks to experience Spark in the cloud.
Airline Delays and Weather
Amazon Customer Reviews
Yelp Customer Reviews
Health Insurance Marketplace
NYC Taxi Data
Clothing Data Set
Public data from San Francisco
Credit Card Fraud