Pandas/Jupyter Talk @ Amsterdam Python Meetup

Author : Vincent D. Warmerdam @ GoDataDriven

This notebook and my talk will discuss

how pandas can help you eliminate the bother out of simple data task
how jupyter notebooks might change the way you write scripts
what sugar both these tools have to offe

This notebook (and supplementary references) can be found on my blog: http://koaning.github.io "

Why Notebook? Why Pandas?

Things you learn when working with data:

data is domain specific
I like to see what steps my colleague took
I need things quick and flexible
lots of steps are similar
I like to apply the DRY principle
I prefer typing to mouseclicking

Things that are nice about python:

flexible language
large community
performance
tends to written quickly

Things that are nice about jupyter notebooks:

documentation is gratis!
support for more than just python
sugar for data analysis/quick scripting

Excel fails in many areas. Besides the fact that it cannot handle files larger than 1Gb the main problem with excel is...

Documentation

Documentation is done via markdown in the actual notebook file. Tends to read better than comments in sublime and it makes it extremely easy to share with collegues. Notice that it also has support for formulas and other languages.

$$ a^2 + b^2 = c^2 $$

The use of other languages:

%%javascript 
console.log("This is normal logged output")
console.log("%cDont make me angry. You wont like me when Im angry.","color: green; font-size:35px;font-weight: bold;")

By the way, did you know about this trick?

%%javascript 
console.log("moving kitten %c", "background: url('http://i.imgur.com/TanUtXo.gif'); padding-right: 300px; font-size: 250px; text-align: center")

%%ruby 
puts(1+1)

2

%load_ext rmagic

%%R 
summary(ChickWeight)

     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506

Package Management

You can install things via pip FROM the ipython notebook if you really wanted to.

import pip

def install(package):
   pip.main(['install', package])

# install('pandas')

import numpy as np 
import pandas as pd 
from ggplot import *

Notebook Shortcuts

A list of some useful ones:

ctrl + enter : run cell
shift + enter : run cell and create/select the one after
ctrl + m : enter command mode
ctrl + s :save notebook While in command mode the shortcuts work a bit different:
up/down arrows browser through different cells
y : changes cell to a code cell
m : changes cell to a markdown cell
a : create new cell above
b : create new cell below
l : toggle line numbers
dd : delete cell

Pandas

Pandas gives you a datastructure called a DataFrame. This is the object that contains the data and methods that you'll use the most. Initially, the DataFrame will look like a dictionary that contains arrays with extra functionality but in real life it is an high performant data wrangler that provides a more flexible API than excel. Let's go and create a simple DataFrame.

# first create three random arrays, the first two numeric, the last one string based
a = np.random.randn(10)
b = np.random.randn(10)
c = [ 'foo' if x > 0.5 else 'bar' for x in np.random.rand(10)]
d = { 'a' : pd.Series(a), 'b' : pd.Series(b), 'c': pd.Series(c) } 
d

{'a': 0    0.527814
 1   -1.014339
 2   -0.116806
 3    0.650322
 4   -0.629597
 5    0.517408
 6   -0.472925
 7    0.332460
 8    1.011738
 9   -1.713077
 dtype: float64, 'b': 0   -1.461418
 1    0.597200
 2    1.059903
 3    1.932816
 4   -0.036146
 5    1.839549
 6   -1.766969
 7    1.510172
 8    0.831622
 9    0.436959
 dtype: float64, 'c': 0    foo
 1    foo
 2    foo
 3    bar
 4    bar
 5    foo
 6    foo
 7    bar
 8    foo
 9    bar
 dtype: object}

Note that difference between a and pd.Series(a). Pandas translates arrays into a pd.Series object. These objects have characteristics that you would expect from an array but it allows for a flexible index. Normal arrays only have numerical indices but this object also allows for dates and strings. This also means that you can use said indices for more flexible selections of data.

Source : https://www.youtube.com/watch?v=qbYYamU42Sw

Also note that pandas lists the dtype of the array, a normal python array will not do this. Pandas keeps track of what datatype is in the array, again to help you make selections. If the types aren't clear to pandas it will refer to it as an object array.

Let's now use this dictionary to create a DataFrame object.

df = pd.DataFrame(d) 
df

This DataFrame object will be the main object you will talk to when using pandas. Note the column names are just like those assigned in the dictionary. Also note that the indices of this data frame at the same indices as the original pd.Series objects. This object should feel like the traditional excel table.

List of simple queries we can ask to this object.

# select a column from the dataframe, dict-style
df['a']

0    0.527814
1   -1.014339
2   -0.116806
3    0.650322
4   -0.629597
5    0.517408
6   -0.472925
7    0.332460
8    1.011738
9   -1.713077
Name: a, dtype: float64

# so you can create a boolean list 
df.a < 0

0    False
1    False
2     True
3    False
4     True
5     True
6     True
7    False
8     True
9    False
Name: a, dtype: bool

# this column behaves just like a numpy array 
df['e'] = 1 + df.a
df

# this true-false list can be used on the dataframe for selection
df[df.a < 0]

# you can combine true false lists
df[ (df.a < 0) & (df.b < 0) ]

# and we could again only get a certain column back 
df[ (df.a < 0) & (df.b < 0) ].c

4    bar
6    foo
Name: c, dtype: object

Notice for this last query that we don't just get array values back. We also get the original indices from the dataframe.

# we can also ask things directly to the DataFrame 
df.head(1)

# what has a head, usually has a tail
df.tail(1)

Note the indices for these two methods.

# if you want to know the number of rows/columns 
df.shape

(10, 5)

# quick summary of data 
df.describe()

# there are some basic functions you can use to aggregate 
# axis = 0 indicates that we applying a function per column
df.mean(axis=0)

a    0.087573
b    0.203792
g    0.291366
e    1.087573
dtype: float64

# there are some basic functions you can use to aggregate 
# axis = 1 indicates that we applying a function per row
df.mean(axis=1)

0    0.821506
1    1.046932
2    0.575634
3    0.015595
4    0.934312
5    0.238567
6   -0.371889
7    0.683327
8    0.721095
9   -0.489318
dtype: float64

Small Historial Note

Pandas was written by Wes McKinney, who wrote it while working at a financial institution. Financial data analysis usually involves working a lot with dates. It is no coincidence that pandas has great support for working with time series. We can easily apply functions per week or per month.

The timeseries support is vast. Pandas only requires that the index of a dataframe is a timestamp. It then allows you to perform any function on any grouped part of data. So the index that is supplied doesn't have to be a number, it can also be a timestamp!

Split-Apply-Combine

A moment of abstraction.

Very often we will want to perform group operations on data. In the financial example we saw that we wanted to perform operations for every week or for every month. In another example we might want to apply methods per geolocation/type of person/per website... etc

Whenever we are doing such an operation, we might look at it as a split-apply-combine operation, shown visually below:

Pandas has great support for these kinds of operations. Based on the key of a dataframe we will split the data and then apply functions to the grouped data.

Enter ChickWeight

For this next bit we will use the chickweight dataset. This dataset contains information about different diets for chickens. The goal of the dataset is to find out which diet will get the chickens as fat as possible.

chickdf = pd.read_csv('http://koaning.s3-website-us-west-2.amazonaws.com/data/pydata/chickweight.csv')
chickdf.describe()

Imagine doing just that operation in excel.

The goal is to find out which diet causes the most weight gain for chickens, so let's group chickens per diet. You do this by creating a grouped object. This groups the dataframe to groups based on column values. This grouped variable can be iterated over.

grouped = chickdf.groupby(chickdf.Diet)

for thing in grouped:
    print thing

(1,         weight  Time  Chick  Diet
rownum                           
1           42     0      1     1
2           51     2      1     1
3           59     4      1     1
4           64     6      1     1
5           76     8      1     1
6           93    10      1     1
7          106    12      1     1
8          125    14      1     1
9          149    16      1     1
10         171    18      1     1
11         199    20      1     1
12         205    21      1     1
13          40     0      2     1
14          49     2      2     1
15          58     4      2     1
16          72     6      2     1
17          84     8      2     1
18         103    10      2     1
19         122    12      2     1
20         138    14      2     1
21         162    16      2     1
22         187    18      2     1
23         209    20      2     1
24         215    21      2     1
25          43     0      3     1
26          39     2      3     1
27          55     4      3     1
28          67     6      3     1
29          84     8      3     1
30          99    10      3     1
31         115    12      3     1
32         138    14      3     1
33         163    16      3     1
34         187    18      3     1
35         198    20      3     1
36         202    21      3     1
37          42     0      4     1
38          49     2      4     1
39          56     4      4     1
40          67     6      4     1
41          74     8      4     1
42          87    10      4     1
43         102    12      4     1
44         108    14      4     1
45         136    16      4     1
46         154    18      4     1
47         160    20      4     1
48         157    21      4     1
49          41     0      5     1
50          42     2      5     1
51          48     4      5     1
52          60     6      5     1
53          79     8      5     1
54         106    10      5     1
55         141    12      5     1
56         164    14      5     1
57         197    16      5     1
58         199    18      5     1
59         220    20      5     1
60         223    21      5     1
           ...   ...    ...   ...

[220 rows x 4 columns])
(2,         weight  Time  Chick  Diet
rownum                           
221         40     0     21     2
222         50     2     21     2
223         62     4     21     2
224         86     6     21     2
225        125     8     21     2
226        163    10     21     2
227        217    12     21     2
228        240    14     21     2
229        275    16     21     2
230        307    18     21     2
231        318    20     21     2
232        331    21     21     2
233         41     0     22     2
234         55     2     22     2
235         64     4     22     2
236         77     6     22     2
237         90     8     22     2
238         95    10     22     2
239        108    12     22     2
240        111    14     22     2
241        131    16     22     2
242        148    18     22     2
243        164    20     22     2
244        167    21     22     2
245         43     0     23     2
246         52     2     23     2
247         61     4     23     2
248         73     6     23     2
249         90     8     23     2
250        103    10     23     2
251        127    12     23     2
252        135    14     23     2
253        145    16     23     2
254        163    18     23     2
255        170    20     23     2
256        175    21     23     2
257         42     0     24     2
258         52     2     24     2
259         58     4     24     2
260         74     6     24     2
261         66     8     24     2
262         68    10     24     2
263         70    12     24     2
264         71    14     24     2
265         72    16     24     2
266         72    18     24     2
267         76    20     24     2
268         74    21     24     2
269         40     0     25     2
270         49     2     25     2
271         62     4     25     2
272         78     6     25     2
273        102     8     25     2
274        124    10     25     2
275        146    12     25     2
276        164    14     25     2
277        197    16     25     2
278        231    18     25     2
279        259    20     25     2
280        265    21     25     2
           ...   ...    ...   ...

[120 rows x 4 columns])
(3,         weight  Time  Chick  Diet
rownum                           
341         42     0     31     3
342         53     2     31     3
343         62     4     31     3
344         73     6     31     3
345         85     8     31     3
346        102    10     31     3
347        123    12     31     3
348        138    14     31     3
349        170    16     31     3
350        204    18     31     3
351        235    20     31     3
352        256    21     31     3
353         41     0     32     3
354         49     2     32     3
355         65     4     32     3
356         82     6     32     3
357        107     8     32     3
358        129    10     32     3
359        159    12     32     3
360        179    14     32     3
361        221    16     32     3
362        263    18     32     3
363        291    20     32     3
364        305    21     32     3
365         39     0     33     3
366         50     2     33     3
367         63     4     33     3
368         77     6     33     3
369         96     8     33     3
370        111    10     33     3
371        137    12     33     3
372        144    14     33     3
373        151    16     33     3
374        146    18     33     3
375        156    20     33     3
376        147    21     33     3
377         41     0     34     3
378         49     2     34     3
379         63     4     34     3
380         85     6     34     3
381        107     8     34     3
382        134    10     34     3
383        164    12     34     3
384        186    14     34     3
385        235    16     34     3
386        294    18     34     3
387        327    20     34     3
388        341    21     34     3
389         41     0     35     3
390         53     2     35     3
391         64     4     35     3
392         87     6     35     3
393        123     8     35     3
394        158    10     35     3
395        201    12     35     3
396        238    14     35     3
397        287    16     35     3
398        332    18     35     3
399        361    20     35     3
400        373    21     35     3
           ...   ...    ...   ...

[120 rows x 4 columns])
(4,         weight  Time  Chick  Diet
rownum                           
461         42     0     41     4
462         51     2     41     4
463         66     4     41     4
464         85     6     41     4
465        103     8     41     4
466        124    10     41     4
467        155    12     41     4
468        153    14     41     4
469        175    16     41     4
470        184    18     41     4
471        199    20     41     4
472        204    21     41     4
473         42     0     42     4
474         49     2     42     4
475         63     4     42     4
476         84     6     42     4
477        103     8     42     4
478        126    10     42     4
479        160    12     42     4
480        174    14     42     4
481        204    16     42     4
482        234    18     42     4
483        269    20     42     4
484        281    21     42     4
485         42     0     43     4
486         55     2     43     4
487         69     4     43     4
488         96     6     43     4
489        131     8     43     4
490        157    10     43     4
491        184    12     43     4
492        188    14     43     4
493        197    16     43     4
494        198    18     43     4
495        199    20     43     4
496        200    21     43     4
497         42     0     44     4
498         51     2     44     4
499         65     4     44     4
500         86     6     44     4
501        103     8     44     4
502        118    10     44     4
503        127    12     44     4
504        138    14     44     4
505        145    16     44     4
506        146    18     44     4
507         41     0     45     4
508         50     2     45     4
509         61     4     45     4
510         78     6     45     4
511         98     8     45     4
512        117    10     45     4
513        135    12     45     4
514        141    14     45     4
515        147    16     45     4
516        174    18     45     4
517        197    20     45     4
518        196    21     45     4
519         40     0     46     4
520         52     2     46     4
           ...   ...    ...   ...

[118 rows x 4 columns])

Each group still behaves like a dataframe, so we can also apply the describe function here.

grouped.describe()

Applying your own

We can use built in functions on our grouped objects, but we can also just apply our own functions. Note that these functions need to be able to be applied to a DataFrame object.

def show_size(x):
    return "Dude we have " + str(len(x)) + " chickens here!" 

grouped.apply(show_size)

Diet
1       Dude we have 220 chickens here!
2       Dude we have 120 chickens here!
3       Dude we have 120 chickens here!
4       Dude we have 118 chickens here!
dtype: object

Realize that this means that any function can be used here. You will want to think about performance when dealing with large datasets. This functionality is one of the things that make the pandas API very powerful.

Advanced Groups

We can also created groups based on two columns in the table.

chickdf.groupby(['Diet','Time']).weight.apply(show_size)

Diet  Time
1     0       Dude we have 20 chickens here!
      2       Dude we have 20 chickens here!
      4       Dude we have 19 chickens here!
      6       Dude we have 19 chickens here!
      8       Dude we have 19 chickens here!
      10      Dude we have 19 chickens here!
      12      Dude we have 19 chickens here!
      14      Dude we have 18 chickens here!
      16      Dude we have 17 chickens here!
      18      Dude we have 17 chickens here!
      20      Dude we have 17 chickens here!
      21      Dude we have 16 chickens here!
2     0       Dude we have 10 chickens here!
      2       Dude we have 10 chickens here!
      4       Dude we have 10 chickens here!
      6       Dude we have 10 chickens here!
      8       Dude we have 10 chickens here!
      10      Dude we have 10 chickens here!
      12      Dude we have 10 chickens here!
      14      Dude we have 10 chickens here!
      16      Dude we have 10 chickens here!
      18      Dude we have 10 chickens here!
      20      Dude we have 10 chickens here!
      21      Dude we have 10 chickens here!
3     0       Dude we have 10 chickens here!
      2       Dude we have 10 chickens here!
      4       Dude we have 10 chickens here!
      6       Dude we have 10 chickens here!
      8       Dude we have 10 chickens here!
      10      Dude we have 10 chickens here!
      12      Dude we have 10 chickens here!
      14      Dude we have 10 chickens here!
      16      Dude we have 10 chickens here!
      18      Dude we have 10 chickens here!
      20      Dude we have 10 chickens here!
      21      Dude we have 10 chickens here!
4     0       Dude we have 10 chickens here!
      2       Dude we have 10 chickens here!
      4       Dude we have 10 chickens here!
      6       Dude we have 10 chickens here!
      8       Dude we have 10 chickens here!
      10      Dude we have 10 chickens here!
      12      Dude we have 10 chickens here!
      14      Dude we have 10 chickens here!
      16      Dude we have 10 chickens here!
      18      Dude we have 10 chickens here!
      20       Dude we have 9 chickens here!
      21       Dude we have 9 chickens here!
Name: weight, dtype: object

Visualisation

You can also combine pandas with plotting tools like ggplot for fast interactive data exploration.

%pylab inline

ggplot(aes(x='Time', y='weight'), data=chickdf) + geom_point()

Populating the interactive namespace from numpy and matplotlib

<ggplot: (287054765)>

agg = chickdf.groupby(['Diet','Time']).weight.aggregate(mean)
agg = agg.reset_index()
agg

ggplot(aes(x='Time', y='weight', color="Diet"), data=agg) + geom_line()

<ggplot: (288808145)>

But in all seriousness, you might want to start thinking about the ethics used in data.

	a	b	g	e
count	10.000000	10.000000	10.000000	10.000000
mean	0.087573	0.203792	0.291366	1.087573
std	0.767731	0.499275	0.757998	0.767731
min	-1.045761	-0.871284	-1.059343	-0.045761
25%	-0.459889	0.080989	-0.127380	0.540111
50%	0.185772	0.274838	0.536605	1.185772
75%	0.549000	0.371325	0.704920	1.549000
max	1.342864	0.852610	1.340977	2.342864

	rownum	weight	Time	Chick	Diet
count	578.000000	578.000000	578.000000	578.000000	578.000000
mean	289.500000	121.818339	10.717993	25.750865	2.235294
std	166.998503	71.071960	6.758400	14.568795	1.162678
min	1.000000	35.000000	0.000000	1.000000	1.000000
25%	145.250000	63.000000	4.000000	13.000000	1.000000
50%	289.500000	103.000000	10.000000	26.000000	2.000000
75%	433.750000	163.750000	16.000000	38.000000	3.000000
max	578.000000	373.000000	21.000000	50.000000	4.000000

		Chick	Time	weight
Diet
1	count	220.000000	220.000000	220.000000
	mean	9.963636	10.481818	102.645455
	std	5.700160	6.754984	56.656553
	min	1.000000	0.000000	35.000000
	25%	5.000000	4.000000	57.750000
	50%	10.000000	10.000000	88.000000
	75%	14.000000	16.000000	136.500000
	max	20.000000	21.000000	305.000000
2	count	120.000000	120.000000	120.000000
	mean	25.500000	10.916667	122.616667
	std	2.884324	6.804081	71.607495
	min	21.000000	0.000000	39.000000
	25%	23.000000	5.500000	65.500000
	50%	25.500000	11.000000	104.500000
	75%	28.000000	16.500000	163.000000
	max	30.000000	21.000000	331.000000
3	count	120.000000	120.000000	120.000000
	mean	35.500000	10.916667	142.950000
	std	2.884324	6.804081	86.541761
	min	31.000000	0.000000	39.000000
	25%	33.000000	5.500000	67.500000
	50%	35.500000	11.000000	125.500000
	75%	38.000000	16.500000	198.750000
	max	40.000000	21.000000	373.000000
4	count	118.000000	118.000000	118.000000
	mean	45.525424	10.754237	135.262712
	std	2.902141	6.744338	68.828714
	min	41.000000	0.000000	39.000000
	25%	43.000000	4.500000	71.250000
	50%	46.000000	10.000000	129.500000
	75%	48.000000	16.000000	184.750000
	max	50.000000	21.000000	322.000000

	a	b	c
0	0.527814	-1.461418	foo
1	-1.014339	0.597200	foo
2	-0.116806	1.059903	foo
3	0.650322	1.932816	bar
4	-0.629597	-0.036146	bar
5	0.517408	1.839549	foo
6	-0.472925	-1.766969	foo
7	0.332460	1.510172	bar
8	1.011738	0.831622	foo
9	-1.713077	0.436959	bar

	a	b	c	e
0	0.527814	-1.461418	foo	1.527814
1	-1.014339	0.597200	foo	-0.014339
2	-0.116806	1.059903	foo	0.883194
3	0.650322	1.932816	bar	1.650322
4	-0.629597	-0.036146	bar	0.370403
5	0.517408	1.839549	foo	1.517408
6	-0.472925	-1.766969	foo	0.527075
7	0.332460	1.510172	bar	1.332460
8	1.011738	0.831622	foo	2.011738
9	-1.713077	0.436959	bar	-0.713077

	Diet	Time	weight
0	1	0	41.400000
1	1	2	47.250000
2	1	4	56.473684
3	1	6	66.789474
4	1	8	79.684211
5	1	10	93.052632
6	1	12	108.526316
7	1	14	123.388889
8	1	16	144.647059
9	1	18	158.941176
10	1	20	170.411765
11	1	21	177.750000
12	2	0	40.700000
13	2	2	49.400000
14	2	4	59.800000
15	2	6	75.400000
16	2	8	91.700000
17	2	10	108.500000
18	2	12	131.300000
19	2	14	141.900000
20	2	16	164.700000
21	2	18	187.700000
22	2	20	205.600000
23	2	21	214.700000
24	3	0	40.800000
25	3	2	50.400000
26	3	4	62.200000
27	3	6	77.900000
28	3	8	98.400000
29	3	10	117.100000
30	3	12	144.400000
31	3	14	164.500000
32	3	16	197.400000
33	3	18	233.100000
34	3	20	258.900000
35	3	21	270.300000
36	4	0	41.000000
37	4	2	51.800000
38	4	4	64.500000
39	4	6	83.900000
40	4	8	105.600000
41	4	10	126.000000
42	4	12	151.400000
43	4	14	161.800000
44	4	16	182.000000
45	4	18	202.900000
46	4	20	233.888889
47	4	21	238.555556