Author : Vincent D. Warmerdam @ GoDataDriven
This notebook and my talk will discuss
This notebook (and supplementary references) can be found on my blog: http://koaning.github.io "
Things you learn when working with data:
Things that are nice about python:
Things that are nice about jupyter notebooks:
Excel fails in many areas. Besides the fact that it cannot handle files larger than 1Gb the main problem with excel is...
Documentation is done via markdown in the actual notebook file. Tends to read better than comments in sublime and it makes it extremely easy to share with collegues. Notice that it also has support for formulas and other languages.
$$ a^2 + b^2 = c^2 $$
%%javascript
console.log("This is normal logged output")
console.log("%cDont make me angry. You wont like me when Im angry.","color: green; font-size:35px;font-weight: bold;")
By the way, did you know about this trick?
%%javascript
console.log("moving kitten %c", "background: url('http://i.imgur.com/TanUtXo.gif'); padding-right: 300px; font-size: 250px; text-align: center")
%%ruby
puts(1+1)
%load_ext rmagic
%%R
summary(ChickWeight)
You can install things via pip FROM the ipython notebook if you really wanted to.
import pip
def install(package):
pip.main(['install', package])
# install('pandas')
import numpy as np
import pandas as pd
from ggplot import *
A list of some useful ones:
ctrl + s :save notebook While in command mode the shortcuts work a bit different:
up/down arrows browser through different cells
Pandas gives you a datastructure called a DataFrame. This is the object that contains the data and methods that you'll use the most. Initially, the DataFrame will look like a dictionary that contains arrays with extra functionality but in real life it is an high performant data wrangler that provides a more flexible API than excel. Let's go and create a simple DataFrame.
# first create three random arrays, the first two numeric, the last one string based
a = np.random.randn(10)
b = np.random.randn(10)
c = [ 'foo' if x > 0.5 else 'bar' for x in np.random.rand(10)]
d = { 'a' : pd.Series(a), 'b' : pd.Series(b), 'c': pd.Series(c) }
d
Note that difference between a
and pd.Series(a)
. Pandas translates arrays into a pd.Series
object. These objects have characteristics that you would expect from an array but it allows for a flexible index. Normal arrays only have numerical indices but this object also allows for dates and strings. This also means that you can use said indices for more flexible selections of data.
Also note that pandas lists the dtype
of the array, a normal python array will not do this. Pandas keeps track of what datatype is in the array, again to help you make selections. If the types aren't clear to pandas it will refer to it as an object
array.
Let's now use this dictionary to create a DataFrame object.
df = pd.DataFrame(d)
df
This DataFrame object will be the main object you will talk to when using pandas. Note the column names are just like those assigned in the dictionary. Also note that the indices of this data frame at the same indices as the original pd.Series
objects. This object should feel like the traditional excel table.
# select a column from the dataframe, dict-style
df['a']
# so you can create a boolean list
df.a < 0
# this column behaves just like a numpy array
df['e'] = 1 + df.a
df
# this true-false list can be used on the dataframe for selection
df[df.a < 0]
# you can combine true false lists
df[ (df.a < 0) & (df.b < 0) ]
# and we could again only get a certain column back
df[ (df.a < 0) & (df.b < 0) ].c
Notice for this last query that we don't just get array values back. We also get the original indices from the dataframe.
# we can also ask things directly to the DataFrame
df.head(1)
# what has a head, usually has a tail
df.tail(1)
Note the indices for these two methods.
# if you want to know the number of rows/columns
df.shape
# quick summary of data
df.describe()
# there are some basic functions you can use to aggregate
# axis = 0 indicates that we applying a function per column
df.mean(axis=0)
# there are some basic functions you can use to aggregate
# axis = 1 indicates that we applying a function per row
df.mean(axis=1)
Pandas was written by Wes McKinney, who wrote it while working at a financial institution. Financial data analysis usually involves working a lot with dates. It is no coincidence that pandas has great support for working with time series. We can easily apply functions per week or per month.
The timeseries support is vast. Pandas only requires that the index of a dataframe is a timestamp. It then allows you to perform any function on any grouped part of data. So the index that is supplied doesn't have to be a number, it can also be a timestamp!
A moment of abstraction.
Very often we will want to perform group operations on data. In the financial example we saw that we wanted to perform operations for every week or for every month. In another example we might want to apply methods per geolocation/type of person/per website... etc
Whenever we are doing such an operation, we might look at it as a split-apply-combine
operation, shown visually below:
Pandas has great support for these kinds of operations. Based on the key of a dataframe we will split the data and then apply functions to the grouped data.
For this next bit we will use the chickweight dataset. This dataset contains information about different diets for chickens. The goal of the dataset is to find out which diet will get the chickens as fat as possible.
chickdf = pd.read_csv('http://koaning.s3-website-us-west-2.amazonaws.com/data/pydata/chickweight.csv')
chickdf.describe()
Imagine doing just that operation in excel.
The goal is to find out which diet causes the most weight gain for chickens, so let's group chickens per diet. You do this by creating a grouped object. This groups the dataframe to groups based on column values. This grouped
variable can be iterated over.
grouped = chickdf.groupby(chickdf.Diet)
for thing in grouped:
print thing
Each group still behaves like a dataframe, so we can also apply the describe function here.
grouped.describe()
We can use built in functions on our grouped objects, but we can also just apply our own functions. Note that these functions need to be able to be applied to a DataFrame object.
def show_size(x):
return "Dude we have " + str(len(x)) + " chickens here!"
grouped.apply(show_size)
Realize that this means that any function can be used here. You will want to think about performance when dealing with large datasets. This functionality is one of the things that make the pandas API very powerful.
We can also created groups based on two columns in the table.
chickdf.groupby(['Diet','Time']).weight.apply(show_size)
You can also combine pandas with plotting tools like ggplot for fast interactive data exploration.
%pylab inline
ggplot(aes(x='Time', y='weight'), data=chickdf) + geom_point()
agg = chickdf.groupby(['Diet','Time']).weight.aggregate(mean)
agg = agg.reset_index()
agg
ggplot(aes(x='Time', y='weight', color="Diet"), data=agg) + geom_line()
But in all seriousness, you might want to start thinking about the ethics used in data.