Overview

Welcome to CSE 150.

This course will introduce you to data science and provide insight into how to use data to make decisions using visualization and statistical inference. Data scientists spend a significant amount of their time cleaning and manipulating data for use in decision making.1 In this course, we have crafted real-world data for use in our learning to offset much of the work around data cleaning and manipulation. CSE 250, CSE 350/Math 335, CIT 111, CIT 225 and other advanced analytics courses like CSE 450 and Math 488 can help you build those skills.

We will learn the principles of data storage and management for data analysis and visualization through Google Sheets and Tableau. Neither tool requires the use of a programming language for their use in our class. Both are used heavily in the data science space, and they have many connections to R and Python.2

If you have signed up for this class, you are most likely driven by curiosity and interested in how data decisions are made (sometimes called data intuition). Possibly, you have a more empathetic approach to how the world works and how problems can be solved. Finally, you have an eye for visualization and how data is communicated to make impactful decisions.3 The course follows these principles of teaching Data Science

  • Organize the course around a set of diverse case studies
  • Integrate computing into every aspect of the course
  • Teach abstraction, but minimize reliance on mathematical notation
  • Structure course activities to realistically mimic a data scientist’s experience
  • Demonstrate the importance of critical thinking/skepticism through examples

See here for other great quotes about data science and learning. It would also be of value for you to read my learning manifesto

Course Outcomes

As a successful learner, you will be able to;

  1. Organize and store tabular data for time-series, spatial, and measured variables.
  2. Calculate data summaries and produce visualizations from data.
  3. Communicate about data with people of varied backgrounds (e.g., novices, database administrators, data scientists, business decision-makers).
  4. Describe the implications of data visualization and summaries in the decision-making process.

Course Materials and Requirements

We will not focus on traditional statistical hypothesis testing or sophisticated statistical modeling. We will leverage statistics for the concepts of how to visualize uncertainty and variability while keeping our focus on visualization and data handling to focus on ‘safe-stats.’4 We will not focus on traditional statistical hypothesis testing or sophisticated statistical modeling.

Course Materials

We will be using two textbooks throughout the semester. Beyond the textbook, we recommend dual monitors if you are taking this class remotely.

Good Charts: You can purchase digital or hard copy of the book. It is well-reviewed in industry and is structured to be much easier to read than standard textbooks.

CSE 150 Data Intuition and Insight Supplement: We will be covering more than visualization in the course, and we have written supplemental materials that we will use in tandem with Good Charts. This book is online and is free.

We will leverage statistics for the concepts of how to visualize uncertainty and variability while keeping our focus on visualization and data handling to focus on ‘safe-stats’. We do expect that you have familiarity with using web-based software.

Competency Assumptions

There are no course prerequisites for this course. However, we do assume that you love using data to make decisions.

Course Format

We will meet for 1 hour twice a week and use the following weekly rhythm. After the first week, we will complete multiple 4-class day data projects. We expect 1-3 hours of work to be completed outside of class for each class period.

Preparation

In my experience, getting lectured training outside of college is even more expensive than it is in college. A week’s worth of training can cost more than a semester of school here at BYUI. Due to this expense, learning how to digest online material and get up to speed on a topic before going to the expert for questions is a valuable skill to develop. I expect that you have completed the assigned reading material before class begins. You will also have work to complete after class.

Class time

We will use class time to enforce the analytics and visualization concepts needed for the weekly topic covered in the case study.

Grading

Grading is a nasty side effect of mass learning and academia. We are in a class at a university and will have to manage this side effect. However, we do not have to let it control our learning, thinking, or this class. Learning and thinking should motivate each activity.

We will complete work in the following areas.

  1. Weekly reading (30%)
  2. Case Study Preparation (5%)
  3. Case Study (40%)
  4. Teach one another (15%)
  5. Visualization challenge (10%)

Weekly reading

To keep up with the conversation in class, you will need to keep up to date with the reading. We have a couple of quiz questions about the readings with some self-reporting on your reading completion.

Case Study

Our class is focused on 6 data case studies. This section will count for over 50% of your grade.

  • Preparation for the case study (5%): You will have two checkpoints for each project.
  • Case Study completion (40%): Each student will submit their final presentations based on the requirements of the case study.
  • Teach one another (10%): Each student will be responsible for presenting to another student and then provide questions and feedback on another project as a part of teaching one another.

Teach one another

This section includes three elements - 1) Tooltips presentations, 2) Case study presentations, and 3) Case study comments. Each is worth 5% of your overall grade. There will be one quiz that you can repeatedly take over the semester to document your completion.

Visualization challenge

We expect to provide an in-class and take-home challenge that you can start on the last day of class. It will cover three elements.

  1. A messy data set that requires you to describe its issues and what you would do to clean it up for use in visualization.
  2. A data journalism article review where you identify the strengths and weaknesses of the visualizations used to tell their story.
  3. A visualization request to be done in Tableau with the data set that we provide.

Semester deliverables

  1. A cover letter stating the key concepts and techniques that you learned during our projects and your goals to continue learning in this area - include a grade request that represents your knowledge and task completion
  2. A dream data science resume that covers the skills you desire upon graduation.

Goals

After reviewing the material above, please make a list of learning goals you have based on this class. We would enjoy talking with you about those goals in the first few weeks of the semester. We look forward to working with you this semester.

Additional disclaimers

Our class will uphold the following as well.


  1. See New York Times artilce. They say, ‘Yet far too much-handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.’ and quote Monica Rogati, ‘Data wrangling is a huge — and surprisingly so — part of the job, It’s something that is not appreciated by data civilians. At times, it feels like everything we do.’↩︎

  2. Plot.ly for Python, Plot.ly for R, Google Sheets for R, and Google Sheets for Python. See Matthew Lincoln’s great post on using Google Sheets in his data science work.↩︎

  3. What makes a good data scientist-engineer?↩︎

  4. Donoho correctly confirms that applied statisticians regularly engage in all the activities touted in press releases, like the one he quotes: “the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of … applications.” But there is currently a big gap between what statisticians do and what is considered worthy of study. The incentive structures of academic statistics still signal that mathematical statistics and the creation of new models and inferential procedures are more valuable than work related to data manipulation, visualization, and programming. This is reflected in the content of for-credit courses, qualifying exams, and standards for funding and promotion. Graduate students and junior faculty are caught between a rock and a hard place (Waller 2017). It can be very difficult to present modern data scientific work as impactful scholarly activity when the system still defines that primarily as theory and methodology papers. - Wickham and Bryan -↩︎