When I started out doing data I tought myself how to scrape websites. I didn't just do it because it was fun to learn how to scrape; it was my only source of interesting data that I could use to practice my analysis skills. There wasn't as much open data back then, so I had to find them by hand.

Open datasets are important not just for transparancy and insights but also for education. Having a fun dataset is the best way to start learning the joy of analyzing it.

In this document I will describe datasets that I like to use whenever I teach simply because they are fun to analyse. This is a working document as I will mainly use this page for reference, more datasets will be added over time. It will contain both content that I've collected and content from others.

Each dataset has a description, a download link, a suggestion for an analysis and a source. Please be nice and refer to it if you are going to share this data or use it in your research.

Chick Weight

My absolute favorite starter dataset. In this dataset you try to predict how much a chicken will weight right before it is slaughtered. You get to analyse the effect of different diets on the chickens weight as well as the effect of time.

You can download the data here.

Affairs

This dataset from 1978 contains information about politicians having affairs. The goal of the dataset is to find characteristics that make it more plausable for a politician to have an affair. Does the number of children matter? How religious the politician is? Or education/marriage rating?

You can download the data here.

Source : Fair, R. (1977) “A note on the computation of the tobit estimator”, Econometrica, 45, 1723-1727. http://fairmodel.econ.yale.edu/rayfair/pdf/1978A200.PDF.

Computers

This dataset contains data from 1993 to 1995 about the prices of computers. Because it was the 90ies you can check what the effect of the addition of cd-rom drive is on the price of the computer, or the effect of the clock speed on the price. You could even check the effect of ad placements.

lm(formula = price ~ speed + hd + ram + screen, data = Computers)

Residuals:
     Min       1Q   Median       3Q      Max 
-1048.21  -297.09   -48.29   214.30  2538.47

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.33311   88.22222   0.117    0.907    
speed         5.24930    0.27836  18.858   <2e-16 ***
hd           -0.57936    0.03507 -16.520   <2e-16 ***
ram          76.74545    1.53586  49.969   <2e-16 ***
screen      105.52592    6.18879  17.051   <2e-16 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

Residual standard error: 427.5 on 6254 degrees of freedom
Multiple R-squared:  0.4586,    Adjusted R-squared:  0.4583 
F-statistic:  1325 on 4 and 6254 DF,  p-value: < 2.2e-16

You can download the data here.

Source: Stengos, T. and E. Zacharias (2005) “Intertemporal pricing and price discrimination : a semiparametric hedonic analysis of the personal computer market”, Journal of Applied Econometrics, forthcoming

Cigarettes

This dataset contains data about cigaret consumption in the united states. I have a hunch that certain states might have more smokers than others and it may also be that the price elasticity of these smokes is not consistent across states.

The dataset contains state, year, consummer price index, state population, number of packs per capita, state personal income (total, nominal), average combined local tax, average price and excise tax including sales tax.

You can download the data here.

Source:Stock, James H. and Mark W. Watson (2003) Introduction to Econometrics, Addison-Wesley Educational Publishers, http://wps.aw.com/aw_stockwatsn_economtrcs_1, chapter 10.

Stocks

Stocks are always fun to analyse and as always, pandas is your best friend.

> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2010, 1, 1)
> end = datetime.datetime(2013, 1, 27)
> f = web.DataReader("F", 'yahoo', start, end)

This will give you a dataset which you can then analyze.

> print(f.head())
             Open   High    Low  Close     Volume  Adj Close
Date                                                        
2010-01-04  10.17  10.28  10.05  10.28   60855800    9.42672
2010-01-05  10.45  11.24  10.40  10.96  215620200   10.05028
2010-01-06  11.21  11.46  11.13  11.37  200070600   10.42625
2010-01-07  11.46  11.69  11.32  11.66  130201700   10.69218
2010-01-08  11.67  11.74  11.46  11.69  130463000   10.71969

Or even quickly visualize.

> f['Open'].plot()

Pokemon

Even though most fans are nearing their 30ies, pokemon is still an obsession. Which is why the internet created a reliable pokemon api along with a python package that you can use to download pokemon data.

import pykemon
import pandas as pd 
pokemons = []
for i in range(1,300):
    pokemons.append(pykemon.get(pokemon_id=i))

data = [[p.name, p.hp, p.attack, p.types.keys()[0]] for p in pokemons]
df = pd.DataFrame(data, columns=["name","hp","attack","type"])

You might be able to create a model that can predict what pokemon beats another pokemon, or perhaps you would like to analyze the hp/attack ratio for each pokemon ... the possibilities are endless.

Xbox at eBay

Auctions are interesting from a game theory/behavioral science perspective which is why ebay decided to share some data a while ago. This dataset contains bidding information on (amoungst other things) Xbox sales. You can find the data here.

Deaths in movies

Ever wonder how many people died in that action movie? Yep, there's a dataset for that too. Thanks to the fine folks at moviebodycount. For inspiration on how to analyze this dataset I refer you to an exellect notebook by Ramiro Gomez.

A small preview of this notebook; turns out that there is a relationship between movie deaths and IMDB rating.


Github

Github has a pretty great api that you could mine from the command line. It you have curl, jq and csvkit you can turn any github repo into csv data with just two lines of command line.

curl https://api.github.com/users/mbostock/repos | jq '.' > github.json
in2csv github.json > github.csv

Conclusion

This list is far from complete but these datasets should help students to get motivated to learn about data analysis.

Feel free to use any of these datasets to teach! Please, do keep the sources in mind. Others have spent a lot of time on them.

{% endfilter %}