Catching the cheat

Finishing the Semester

  • Case Study 8: Messy data and data science
    • The next project will be done with a partner. How do we want to pick partners?
    • Partners can work heavily together on the data request.
    • Each person needs to do their own slides and research on the data science languages and tools.
  • I am expecting a detailed conversation about The Call Center example on Tuesday.
  • Next Thursday will be spent with you guys requesting data and figuring out what needs to be cleaned.

Only one week after that!

Case Study

Background

You have recently been hired by the U.S. internal revenue service (IRS) to catch corporate cheaters. You have been given three companies to investigate. You will need to decide if the IRS should build a legal case to investigate the institution for fraud.

  • Sino Forest Corporation: You have the values from the financial statement numbers of Sino Forest Corporation’s 2010 Report.
  • Government Entity: A dataset containing the card transactions for a government entity - 2010.
  • General Motors: The amounts paid to vendors for the 90 days preceding General Motor’s 2009 liquidation.

Our challenge

You will be responsible for reporting as much evidence as possible with the data provided for each institution above. The government entity has more available data than the other two, which will require you to dig deeper to find additional clues.

You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided, but you are not expected to use them all,

Deliverables

  • An 8-12 slide presentation to your IRS managers on the case against each entity.
  • At least one slide that shows the statistical test results from the analysis you performed.
  • At least one slide per institution that visualizes their first digit distribution compared to Benford’s law.
  • At least one slide for one of the institutions that compare the last digit distribution to what would be expected.
  • Multiple visualizations of the Government Entity data to find other interesting insights.

Data Exploration

What can we do to orient our managers to the client?

The first few slides need to provide a background. Always include a chart, if possible, on each slide.

  • Who are the companies?
  • Which company are you going to focus in on, and why?
  • More details about the focus company.

What can we do to help them understand Benford’s law and our statistical test?

We need to give them a background on the Benfords distribution and the Chi-Square Goodness of Fit Test.

  • How does it work?
  • A short explanation of each.
  • Visuals for each of our companies.

What additional arguments can we provide for the ‘Government Entity’?

Can we guide the IRS where to search?

  • Additional Benford plots withing sub-categories.
  • Other visualizations?

A conclusions slide

Wrap it up. Make conclusions and recommendations based on the work you have done.