Pandas/Jupyter Talk @ Amsterdam Python Meetup

Author : Vincent D. Warmerdam @ GoDataDriven

This notebook and my talk will discuss

  • how pandas can help you eliminate the bother out of simple data task
  • how jupyter notebooks might change the way you write scripts
  • what sugar both these tools have to offe

This notebook (and supplementary references) can be found on my blog: http://koaning.github.io "

Why Notebook? Why Pandas?

Things you learn when working with data:

  • data is domain specific
  • I like to see what steps my colleague took
  • I need things quick and flexible
  • lots of steps are similar
  • I like to apply the DRY principle
  • I prefer typing to mouseclicking

Things that are nice about python:

  • flexible language
  • large community
  • performance
  • tends to written quickly

Things that are nice about jupyter notebooks:

  • documentation is gratis!
  • support for more than just python
  • sugar for data analysis/quick scripting

Excel fails in many areas. Besides the fact that it cannot handle files larger than 1Gb the main problem with excel is...

Documentation

Documentation is done via markdown in the actual notebook file. Tends to read better than comments in sublime and it makes it extremely easy to share with collegues. Notice that it also has support for formulas and other languages.

$$ a^2 + b^2 = c^2 $$

The use of other languages:

In [6]:
%%javascript 
console.log("This is normal logged output")
console.log("%cDont make me angry. You wont like me when Im angry.","color: green; font-size:35px;font-weight: bold;")

By the way, did you know about this trick?

In [7]:
%%javascript 
console.log("moving kitten %c", "background: url('http://i.imgur.com/TanUtXo.gif'); padding-right: 300px; font-size: 250px; text-align: center")
In [8]:
%%ruby 
puts(1+1)
2

In [9]:
%load_ext rmagic 
In [10]:
%%R 
summary(ChickWeight)
     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506          

Package Management

You can install things via pip FROM the ipython notebook if you really wanted to.

In [4]:
import pip

def install(package):
   pip.main(['install', package])

# install('pandas') 
In [11]:
import numpy as np 
import pandas as pd 
from ggplot import *

Notebook Shortcuts

A list of some useful ones:

  • ctrl + enter : run cell
  • shift + enter : run cell and create/select the one after
  • ctrl + m : enter command mode
  • ctrl + s :save notebook While in command mode the shortcuts work a bit different:

  • up/down arrows browser through different cells

  • y : changes cell to a code cell
  • m : changes cell to a markdown cell
  • a : create new cell above
  • b : create new cell below
  • l : toggle line numbers
  • dd : delete cell

Pandas

Pandas gives you a datastructure called a DataFrame. This is the object that contains the data and methods that you'll use the most. Initially, the DataFrame will look like a dictionary that contains arrays with extra functionality but in real life it is an high performant data wrangler that provides a more flexible API than excel. Let's go and create a simple DataFrame.

In [13]:
# first create three random arrays, the first two numeric, the last one string based
a = np.random.randn(10)
b = np.random.randn(10)
c = [ 'foo' if x > 0.5 else 'bar' for x in np.random.rand(10)]
d = { 'a' : pd.Series(a), 'b' : pd.Series(b), 'c': pd.Series(c) } 
d
Out[13]:
{'a': 0    0.527814
 1   -1.014339
 2   -0.116806
 3    0.650322
 4   -0.629597
 5    0.517408
 6   -0.472925
 7    0.332460
 8    1.011738
 9   -1.713077
 dtype: float64, 'b': 0   -1.461418
 1    0.597200
 2    1.059903
 3    1.932816
 4   -0.036146
 5    1.839549
 6   -1.766969
 7    1.510172
 8    0.831622
 9    0.436959
 dtype: float64, 'c': 0    foo
 1    foo
 2    foo
 3    bar
 4    bar
 5    foo
 6    foo
 7    bar
 8    foo
 9    bar
 dtype: object}

Note that difference between a and pd.Series(a). Pandas translates arrays into a pd.Series object. These objects have characteristics that you would expect from an array but it allows for a flexible index. Normal arrays only have numerical indices but this object also allows for dates and strings. This also means that you can use said indices for more flexible selections of data.

Source : https://www.youtube.com/watch?v=qbYYamU42Sw

Also note that pandas lists the dtype of the array, a normal python array will not do this. Pandas keeps track of what datatype is in the array, again to help you make selections. If the types aren't clear to pandas it will refer to it as an object array.

Let's now use this dictionary to create a DataFrame object.

In [15]:
df = pd.DataFrame(d) 
df
Out[15]:
a b c
0 0.527814 -1.461418 foo
1 -1.014339 0.597200 foo
2 -0.116806 1.059903 foo
3 0.650322 1.932816 bar
4 -0.629597 -0.036146 bar
5 0.517408 1.839549 foo
6 -0.472925 -1.766969 foo
7 0.332460 1.510172 bar
8 1.011738 0.831622 foo
9 -1.713077 0.436959 bar

10 rows × 3 columns

This DataFrame object will be the main object you will talk to when using pandas. Note the column names are just like those assigned in the dictionary. Also note that the indices of this data frame at the same indices as the original pd.Series objects. This object should feel like the traditional excel table.

List of simple queries we can ask to this object.

In [16]:
# select a column from the dataframe, dict-style
df['a']
Out[16]:
0    0.527814
1   -1.014339
2   -0.116806
3    0.650322
4   -0.629597
5    0.517408
6   -0.472925
7    0.332460
8    1.011738
9   -1.713077
Name: a, dtype: float64
In [32]:
# so you can create a boolean list 
df.a < 0
Out[32]:
0    False
1    False
2     True
3    False
4     True
5     True
6     True
7    False
8     True
9    False
Name: a, dtype: bool
In [21]:
# this column behaves just like a numpy array 
df['e'] = 1 + df.a
df
Out[21]:
a b c e
0 0.527814 -1.461418 foo 1.527814
1 -1.014339 0.597200 foo -0.014339
2 -0.116806 1.059903 foo 0.883194
3 0.650322 1.932816 bar 1.650322
4 -0.629597 -0.036146 bar 0.370403
5 0.517408 1.839549 foo 1.517408
6 -0.472925 -1.766969 foo 0.527075
7 0.332460 1.510172 bar 1.332460
8 1.011738 0.831622 foo 2.011738
9 -1.713077 0.436959 bar -0.713077

10 rows × 4 columns

In [23]:
# this true-false list can be used on the dataframe for selection
df[df.a < 0]
Out[23]:
a b c e
1 -1.014339 0.597200 foo -0.014339
2 -0.116806 1.059903 foo 0.883194
4 -0.629597 -0.036146 bar 0.370403
6 -0.472925 -1.766969 foo 0.527075
9 -1.713077 0.436959 bar -0.713077

5 rows × 4 columns

In [26]:
# you can combine true false lists
df[ (df.a < 0) & (df.b < 0) ] 
Out[26]:
a b c e
4 -0.629597 -0.036146 bar 0.370403
6 -0.472925 -1.766969 foo 0.527075

2 rows × 4 columns

In [28]:
# and we could again only get a certain column back 
df[ (df.a < 0) & (df.b < 0) ].c
Out[28]:
4    bar
6    foo
Name: c, dtype: object

Notice for this last query that we don't just get array values back. We also get the original indices from the dataframe.

In [62]:
# we can also ask things directly to the DataFrame 
df.head(1)
Out[62]:
a b c g e
0 1.342864 -0.871284 foo 0.471581 2.342864
In [63]:
# what has a head, usually has a tail
df.tail(1)
Out[63]:
a b c g e
9 -0.838586 -0.220757 bar -1.059343 0.161414

Note the indices for these two methods.

In [64]:
# if you want to know the number of rows/columns 
df.shape
Out[64]:
(10, 5)
In [65]:
# quick summary of data 
df.describe()
Out[65]:
a b g e
count 10.000000 10.000000 10.000000 10.000000
mean 0.087573 0.203792 0.291366 1.087573
std 0.767731 0.499275 0.757998 0.767731
min -1.045761 -0.871284 -1.059343 -0.045761
25% -0.459889 0.080989 -0.127380 0.540111
50% 0.185772 0.274838 0.536605 1.185772
75% 0.549000 0.371325 0.704920 1.549000
max 1.342864 0.852610 1.340977 2.342864
In [66]:
# there are some basic functions you can use to aggregate 
# axis = 0 indicates that we applying a function per column
df.mean(axis=0)
Out[66]:
a    0.087573
b    0.203792
g    0.291366
e    1.087573
dtype: float64
In [67]:
# there are some basic functions you can use to aggregate 
# axis = 1 indicates that we applying a function per row
df.mean(axis=1)
Out[67]:
0    0.821506
1    1.046932
2    0.575634
3    0.015595
4    0.934312
5    0.238567
6   -0.371889
7    0.683327
8    0.721095
9   -0.489318
dtype: float64
Small Historial Note

Pandas was written by Wes McKinney, who wrote it while working at a financial institution. Financial data analysis usually involves working a lot with dates. It is no coincidence that pandas has great support for working with time series. We can easily apply functions per week or per month.

The timeseries support is vast. Pandas only requires that the index of a dataframe is a timestamp. It then allows you to perform any function on any grouped part of data. So the index that is supplied doesn't have to be a number, it can also be a timestamp!

Split-Apply-Combine

A moment of abstraction.

Very often we will want to perform group operations on data. In the financial example we saw that we wanted to perform operations for every week or for every month. In another example we might want to apply methods per geolocation/type of person/per website... etc

Whenever we are doing such an operation, we might look at it as a split-apply-combine operation, shown visually below:

Pandas has great support for these kinds of operations. Based on the key of a dataframe we will split the data and then apply functions to the grouped data.

Enter ChickWeight

For this next bit we will use the chickweight dataset. This dataset contains information about different diets for chickens. The goal of the dataset is to find out which diet will get the chickens as fat as possible.

In [33]:
chickdf = pd.read_csv('http://koaning.s3-website-us-west-2.amazonaws.com/data/pydata/chickweight.csv')
chickdf.describe()
Out[33]:
rownum weight Time Chick Diet
count 578.000000 578.000000 578.000000 578.000000 578.000000
mean 289.500000 121.818339 10.717993 25.750865 2.235294
std 166.998503 71.071960 6.758400 14.568795 1.162678
min 1.000000 35.000000 0.000000 1.000000 1.000000
25% 145.250000 63.000000 4.000000 13.000000 1.000000
50% 289.500000 103.000000 10.000000 26.000000 2.000000
75% 433.750000 163.750000 16.000000 38.000000 3.000000
max 578.000000 373.000000 21.000000 50.000000 4.000000

8 rows × 5 columns

Imagine doing just that operation in excel.

The goal is to find out which diet causes the most weight gain for chickens, so let's group chickens per diet. You do this by creating a grouped object. This groups the dataframe to groups based on column values. This grouped variable can be iterated over.

In [70]:
grouped = chickdf.groupby(chickdf.Diet)

for thing in grouped:
    print thing
(1,         weight  Time  Chick  Diet
rownum                           
1           42     0      1     1
2           51     2      1     1
3           59     4      1     1
4           64     6      1     1
5           76     8      1     1
6           93    10      1     1
7          106    12      1     1
8          125    14      1     1
9          149    16      1     1
10         171    18      1     1
11         199    20      1     1
12         205    21      1     1
13          40     0      2     1
14          49     2      2     1
15          58     4      2     1
16          72     6      2     1
17          84     8      2     1
18         103    10      2     1
19         122    12      2     1
20         138    14      2     1
21         162    16      2     1
22         187    18      2     1
23         209    20      2     1
24         215    21      2     1
25          43     0      3     1
26          39     2      3     1
27          55     4      3     1
28          67     6      3     1
29          84     8      3     1
30          99    10      3     1
31         115    12      3     1
32         138    14      3     1
33         163    16      3     1
34         187    18      3     1
35         198    20      3     1
36         202    21      3     1
37          42     0      4     1
38          49     2      4     1
39          56     4      4     1
40          67     6      4     1
41          74     8      4     1
42          87    10      4     1
43         102    12      4     1
44         108    14      4     1
45         136    16      4     1
46         154    18      4     1
47         160    20      4     1
48         157    21      4     1
49          41     0      5     1
50          42     2      5     1
51          48     4      5     1
52          60     6      5     1
53          79     8      5     1
54         106    10      5     1
55         141    12      5     1
56         164    14      5     1
57         197    16      5     1
58         199    18      5     1
59         220    20      5     1
60         223    21      5     1
           ...   ...    ...   ...

[220 rows x 4 columns])
(2,         weight  Time  Chick  Diet
rownum                           
221         40     0     21     2
222         50     2     21     2
223         62     4     21     2
224         86     6     21     2
225        125     8     21     2
226        163    10     21     2
227        217    12     21     2
228        240    14     21     2
229        275    16     21     2
230        307    18     21     2
231        318    20     21     2
232        331    21     21     2
233         41     0     22     2
234         55     2     22     2
235         64     4     22     2
236         77     6     22     2
237         90     8     22     2
238         95    10     22     2
239        108    12     22     2
240        111    14     22     2
241        131    16     22     2
242        148    18     22     2
243        164    20     22     2
244        167    21     22     2
245         43     0     23     2
246         52     2     23     2
247         61     4     23     2
248         73     6     23     2
249         90     8     23     2
250        103    10     23     2
251        127    12     23     2
252        135    14     23     2
253        145    16     23     2
254        163    18     23     2
255        170    20     23     2
256        175    21     23     2
257         42     0     24     2
258         52     2     24     2
259         58     4     24     2
260         74     6     24     2
261         66     8     24     2
262         68    10     24     2
263         70    12     24     2
264         71    14     24     2
265         72    16     24     2
266         72    18     24     2
267         76    20     24     2
268         74    21     24     2
269         40     0     25     2
270         49     2     25     2
271         62     4     25     2
272         78     6     25     2
273        102     8     25     2
274        124    10     25     2
275        146    12     25     2
276        164    14     25     2
277        197    16     25     2
278        231    18     25     2
279        259    20     25     2
280        265    21     25     2
           ...   ...    ...   ...

[120 rows x 4 columns])
(3,         weight  Time  Chick  Diet
rownum                           
341         42     0     31     3
342         53     2     31     3
343         62     4     31     3
344         73     6     31     3
345         85     8     31     3
346        102    10     31     3
347        123    12     31     3
348        138    14     31     3
349        170    16     31     3
350        204    18     31     3
351        235    20     31     3
352        256    21     31     3
353         41     0     32     3
354         49     2     32     3
355         65     4     32     3
356         82     6     32     3
357        107     8     32     3
358        129    10     32     3
359        159    12     32     3
360        179    14     32     3
361        221    16     32     3
362        263    18     32     3
363        291    20     32     3
364        305    21     32     3
365         39     0     33     3
366         50     2     33     3
367         63     4     33     3
368         77     6     33     3
369         96     8     33     3
370        111    10     33     3
371        137    12     33     3
372        144    14     33     3
373        151    16     33     3
374        146    18     33     3
375        156    20     33     3
376        147    21     33     3
377         41     0     34     3
378         49     2     34     3
379         63     4     34     3
380         85     6     34     3
381        107     8     34     3
382        134    10     34     3
383        164    12     34     3
384        186    14     34     3
385        235    16     34     3
386        294    18     34     3
387        327    20     34     3
388        341    21     34     3
389         41     0     35     3
390         53     2     35     3
391         64     4     35     3
392         87     6     35     3
393        123     8     35     3
394        158    10     35     3
395        201    12     35     3
396        238    14     35     3
397        287    16     35     3
398        332    18     35     3
399        361    20     35     3
400        373    21     35     3
           ...   ...    ...   ...

[120 rows x 4 columns])
(4,         weight  Time  Chick  Diet
rownum                           
461         42     0     41     4
462         51     2     41     4
463         66     4     41     4
464         85     6     41     4
465        103     8     41     4
466        124    10     41     4
467        155    12     41     4
468        153    14     41     4
469        175    16     41     4
470        184    18     41     4
471        199    20     41     4
472        204    21     41     4
473         42     0     42     4
474         49     2     42     4
475         63     4     42     4
476         84     6     42     4
477        103     8     42     4
478        126    10     42     4
479        160    12     42     4
480        174    14     42     4
481        204    16     42     4
482        234    18     42     4
483        269    20     42     4
484        281    21     42     4
485         42     0     43     4
486         55     2     43     4
487         69     4     43     4
488         96     6     43     4
489        131     8     43     4
490        157    10     43     4
491        184    12     43     4
492        188    14     43     4
493        197    16     43     4
494        198    18     43     4
495        199    20     43     4
496        200    21     43     4
497         42     0     44     4
498         51     2     44     4
499         65     4     44     4
500         86     6     44     4
501        103     8     44     4
502        118    10     44     4
503        127    12     44     4
504        138    14     44     4
505        145    16     44     4
506        146    18     44     4
507         41     0     45     4
508         50     2     45     4
509         61     4     45     4
510         78     6     45     4
511         98     8     45     4
512        117    10     45     4
513        135    12     45     4
514        141    14     45     4
515        147    16     45     4
516        174    18     45     4
517        197    20     45     4
518        196    21     45     4
519         40     0     46     4
520         52     2     46     4
           ...   ...    ...   ...

[118 rows x 4 columns])

Each group still behaves like a dataframe, so we can also apply the describe function here.

In [29]:
grouped.describe()
Out[29]:
Chick Time weight
Diet
1 count 220.000000 220.000000 220.000000
mean 9.963636 10.481818 102.645455
std 5.700160 6.754984 56.656553
min 1.000000 0.000000 35.000000
25% 5.000000 4.000000 57.750000
50% 10.000000 10.000000 88.000000
75% 14.000000 16.000000 136.500000
max 20.000000 21.000000 305.000000
2 count 120.000000 120.000000 120.000000
mean 25.500000 10.916667 122.616667
std 2.884324 6.804081 71.607495
min 21.000000 0.000000 39.000000
25% 23.000000 5.500000 65.500000
50% 25.500000 11.000000 104.500000
75% 28.000000 16.500000 163.000000
max 30.000000 21.000000 331.000000
3 count 120.000000 120.000000 120.000000
mean 35.500000 10.916667 142.950000
std 2.884324 6.804081 86.541761
min 31.000000 0.000000 39.000000
25% 33.000000 5.500000 67.500000
50% 35.500000 11.000000 125.500000
75% 38.000000 16.500000 198.750000
max 40.000000 21.000000 373.000000
4 count 118.000000 118.000000 118.000000
mean 45.525424 10.754237 135.262712
std 2.902141 6.744338 68.828714
min 41.000000 0.000000 39.000000
25% 43.000000 4.500000 71.250000
50% 46.000000 10.000000 129.500000
75% 48.000000 16.000000 184.750000
max 50.000000 21.000000 322.000000

Applying your own

We can use built in functions on our grouped objects, but we can also just apply our own functions. Note that these functions need to be able to be applied to a DataFrame object.

In [71]:
def show_size(x):
    return "Dude we have " + str(len(x)) + " chickens here!" 

grouped.apply(show_size)
Out[71]:
Diet
1       Dude we have 220 chickens here!
2       Dude we have 120 chickens here!
3       Dude we have 120 chickens here!
4       Dude we have 118 chickens here!
dtype: object

Realize that this means that any function can be used here. You will want to think about performance when dealing with large datasets. This functionality is one of the things that make the pandas API very powerful.

Advanced Groups

We can also created groups based on two columns in the table.

In [72]:
chickdf.groupby(['Diet','Time']).weight.apply(show_size)
Out[72]:
Diet  Time
1     0       Dude we have 20 chickens here!
      2       Dude we have 20 chickens here!
      4       Dude we have 19 chickens here!
      6       Dude we have 19 chickens here!
      8       Dude we have 19 chickens here!
      10      Dude we have 19 chickens here!
      12      Dude we have 19 chickens here!
      14      Dude we have 18 chickens here!
      16      Dude we have 17 chickens here!
      18      Dude we have 17 chickens here!
      20      Dude we have 17 chickens here!
      21      Dude we have 16 chickens here!
2     0       Dude we have 10 chickens here!
      2       Dude we have 10 chickens here!
      4       Dude we have 10 chickens here!
      6       Dude we have 10 chickens here!
      8       Dude we have 10 chickens here!
      10      Dude we have 10 chickens here!
      12      Dude we have 10 chickens here!
      14      Dude we have 10 chickens here!
      16      Dude we have 10 chickens here!
      18      Dude we have 10 chickens here!
      20      Dude we have 10 chickens here!
      21      Dude we have 10 chickens here!
3     0       Dude we have 10 chickens here!
      2       Dude we have 10 chickens here!
      4       Dude we have 10 chickens here!
      6       Dude we have 10 chickens here!
      8       Dude we have 10 chickens here!
      10      Dude we have 10 chickens here!
      12      Dude we have 10 chickens here!
      14      Dude we have 10 chickens here!
      16      Dude we have 10 chickens here!
      18      Dude we have 10 chickens here!
      20      Dude we have 10 chickens here!
      21      Dude we have 10 chickens here!
4     0       Dude we have 10 chickens here!
      2       Dude we have 10 chickens here!
      4       Dude we have 10 chickens here!
      6       Dude we have 10 chickens here!
      8       Dude we have 10 chickens here!
      10      Dude we have 10 chickens here!
      12      Dude we have 10 chickens here!
      14      Dude we have 10 chickens here!
      16      Dude we have 10 chickens here!
      18      Dude we have 10 chickens here!
      20       Dude we have 9 chickens here!
      21       Dude we have 9 chickens here!
Name: weight, dtype: object

Visualisation

You can also combine pandas with plotting tools like ggplot for fast interactive data exploration.

In [76]:
%pylab inline

ggplot(aes(x='Time', y='weight'), data=chickdf) + geom_point() 
Populating the interactive namespace from numpy and matplotlib

Out[76]:
<ggplot: (287054765)>
In [102]:
agg = chickdf.groupby(['Diet','Time']).weight.aggregate(mean)
agg = agg.reset_index()
agg
Out[102]:
Diet Time weight
0 1 0 41.400000
1 1 2 47.250000
2 1 4 56.473684
3 1 6 66.789474
4 1 8 79.684211
5 1 10 93.052632
6 1 12 108.526316
7 1 14 123.388889
8 1 16 144.647059
9 1 18 158.941176
10 1 20 170.411765
11 1 21 177.750000
12 2 0 40.700000
13 2 2 49.400000
14 2 4 59.800000
15 2 6 75.400000
16 2 8 91.700000
17 2 10 108.500000
18 2 12 131.300000
19 2 14 141.900000
20 2 16 164.700000
21 2 18 187.700000
22 2 20 205.600000
23 2 21 214.700000
24 3 0 40.800000
25 3 2 50.400000
26 3 4 62.200000
27 3 6 77.900000
28 3 8 98.400000
29 3 10 117.100000
30 3 12 144.400000
31 3 14 164.500000
32 3 16 197.400000
33 3 18 233.100000
34 3 20 258.900000
35 3 21 270.300000
36 4 0 41.000000
37 4 2 51.800000
38 4 4 64.500000
39 4 6 83.900000
40 4 8 105.600000
41 4 10 126.000000
42 4 12 151.400000
43 4 14 161.800000
44 4 16 182.000000
45 4 18 202.900000
46 4 20 233.888889
47 4 21 238.555556

48 rows × 3 columns

In [109]:
ggplot(aes(x='Time', y='weight', color="Diet"), data=agg) + geom_line() 
Out[109]:
<ggplot: (288808145)>

But in all seriousness, you might want to start thinking about the ethics used in data.

More Resources

In []: