Finding a Dataset

Introduction

So far in class, we have encountered only fairly clean data examples carefully managed and stored in an easy to access location. We have been able to use the import() function from the rio library with a link to a well-contained data resource.

When we encounter data in the wild, it can be much more complicated to extract and clean up. However, good research should be transparent, with data sources cited and, where possible, published. We can often find links to raw data for graphs.

For this assignment, you will find a dataset that interests you online and submit a link to the location of the data.

Lesson Objectives:

  1. Start with a reaserch question that interests you. It can be about any topic (finance, music, video games, hunting, weather, mental health, AI…seriously anything!)
  2. Use the resources provided in this document to find a related dataset to your research question
  3. Refine your research question

Research Question

For this project, a good research question will be interesting to you AND feasible to find available data. You can certainly think of exciting research questions for which the data are impossible to find or collect. Expect to refine your research question as you begin the data search.

Once you have a research question in mind, start looking for data relating to it.

Finding Data

There are several online resources for data searches. Below are some good places to start.

Google actually has a search engine specifically for datasets

Statista has a datasets for a wide range of topics but is particularly well suited for government policy-related data such as health, crime, social science.

Kaggle runs competitions for companies who outsource data challenges. It has also compiled a large library of datasets on a range of topics. Because businesses run competitions through here, there are a lot of datasets related to specific challenges that businesses face.

Refining the Question

We are not always able to find the exact data that addressess our research question. This is a limitation of not designing your own study from scratch.

If you are unable to find the perfect data, refine your question.

It’s still a good idea to start with a question that interests you. You’re more likely to end up with a dataset that you enjoy working with.

Once you’ve found a dataset of interest, submit the link to the website into Canvas.