4. Python Data Wrangling I¶

Damian Trilling and Penny Sheets

This notebook outlines the

(3) Enrichment
(4) Analysis

of two CBS datasets. We made a different notebook (5. Python Data Wrangling II) that helps you reconstructing how we did the

(1) Retrieval
(2) Preprocessing

to construct the files for this examples.

import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

Obtain both datasets by either working through Notebook 5. or by downloading both files from here:

population=pd.read_json('population.json')
economy=pd.read_json('economy.json')

Your Task¶

use methods like .head(), .describe() and/or .value_counts() to get a sense of both datasets.
what are the common characteristics between the datasets, what are the differences?

# your code here

population.head()

	Regions	Periods	LiveBornChildren_2	NetMigrationExcludingAdministrative_19
0	Groningen	1960	8868.0	-1748.0
1	Groningen	1961	9062.0	-1087.0
10	Groningen	1970	9774.0	196.0
100	Friesland	2002	7987.0	2339.0
101	Friesland	2003	7932.0	1196.0

economy.head()

	Regions	Periods	GDPVolumeChanges_1
0	Groningen	1996	9.3
1	Groningen	1997	-2.0
10	Groningen	2006	1.1
100	Flevoland	2008	-0.8
101	Flevoland	2009	-5.4

population['Periods'].value_counts()

  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
  12
Name: Periods, dtype: int64

population.describe()

	Periods	LiveBornChildren_2	NetMigrationExcludingAdministrative_19
count	696.000000	670.000000	670.000000
mean	1988.500000	17076.928358	3086.132836
std	16.752708	12387.755648	5049.005650
min	1960.000000	3357.000000	-15648.000000
25%	1974.000000	6511.250000	272.000000
50%	1988.500000	13359.000000	1784.500000
75%	2003.000000	26237.250000	4970.750000
max	2017.000000	55295.000000	31545.000000

economy['Regions'].value_counts().sort_index()

Drenthe          22
Flevoland        22
Friesland        22
Gelderland       22
Groningen        22
Limburg          22
Noord-Brabant    22
Noord-Holland    22
Overijssel       22
Utrecht          22
Zeeland          22
Zuid-Holland     22
Name: Regions, dtype: int64

Discuss: What type of join?¶

Discuss with your neighbor

what type of join (inner, outer, left, right) you want; and
which column(s) to join on

Then, create a combined dataframe with a command along the lines of

df = population.merge(economy, on='columnname'], how='left/right/inner/outer')

or if you have multiple columns to join on:

df = population.merge(economy, on=['columnname','columnname'], how='left/right/inner/outer')

df = economy.merge(population, on= ['Periods', 'Regions'], how='left')

df

	Regions	Periods	GDPVolumeChanges_1	LiveBornChildren_2	NetMigrationExcludingAdministrative_19
0	Groningen	1996	9.3	6148.0	-336.0
1	Groningen	1997	-2.0	6336.0	-647.0
2	Groningen	2006	1.1	5838.0	65.0
3	Flevoland	2008	-0.8	5101.0	1984.0
4	Flevoland	2009	-5.4	5292.0	1519.0
5	Flevoland	2010	3.2	5310.0	1359.0
6	Flevoland	2011	2.0	5090.0	1162.0
7	Flevoland	2012	-1.0	4991.0	339.0
8	Flevoland	2013	-2.6	4687.0	-234.0
9	Flevoland	2014	3.0	4922.0	-685.0
10	Flevoland	2015	2.9	4735.0	475.0
11	Flevoland	2016	2.6	4706.0	1955.0
12	Flevoland	2017	4.2	4565.0	2045.0
13	Groningen	2007	-2.1	5677.0	234.0
14	Gelderland	1996	2.9	23171.0	4434.0
15	Gelderland	1997	5.1	23461.0	3860.0
16	Gelderland	1998	4.3	24425.0	4213.0
17	Gelderland	1999	3.9	24508.0	5521.0
18	Gelderland	2000	4.5	25370.0	7517.0
19	Gelderland	2001	2.6	24626.0	8246.0
20	Gelderland	2002	-0.6	24826.0	5530.0
21	Gelderland	2003	0.6	24271.0	2037.0
22	Gelderland	2004	0.9	23235.0	1708.0
23	Gelderland	2005	1.8	22202.0	437.0
24	Groningen	2008	8.7	5908.0	827.0
25	Gelderland	2006	5.2	22213.0	-282.0
26	Gelderland	2007	3.2	21191.0	1434.0
27	Gelderland	2008	1.8	21695.0	3614.0
28	Gelderland	2009	-2.3	21350.0	4736.0
29	Gelderland	2010	-0.2	21142.0	4866.0
...	...	...	...	...	...
234	Overijssel	2002	-0.8	14663.0	3024.0
235	Overijssel	2003	0.9	14857.0	1254.0
236	Overijssel	2004	1.4	14098.0	367.0
237	Overijssel	2005	2.6	13942.0	1081.0
238	Overijssel	2006	3.1	13667.0	275.0
239	Overijssel	2007	4.1	13368.0	363.0
240	Overijssel	2008	3.5	13411.0	1747.0
241	Overijssel	2009	-2.0	13260.0	1763.0
242	Groningen	2004	2.2	6141.0	1273.0
243	Overijssel	2010	-0.1	13180.0	1620.0
244	Overijssel	2011	3.3	12651.0	1316.0
245	Overijssel	2012	-3.4	12367.0	-161.0
246	Overijssel	2013	-1.1	12151.0	-812.0
247	Overijssel	2014	1.4	12092.0	-360.0
248	Overijssel	2015	2.8	11765.0	2694.0
249	Overijssel	2016	2.7	11728.0	3110.0
250	Overijssel	2017	3.4	11345.0	3629.0
251	Flevoland	1996	3.3	4170.0	6331.0
252	Flevoland	1997	8.2	4365.0	9135.0
253	Groningen	2005	-0.7	5943.0	-33.0
254	Flevoland	1998	7.5	4785.0	10310.0
255	Flevoland	1999	11.4	4933.0	7733.0
256	Flevoland	2000	5.3	5064.0	8728.0
257	Flevoland	2001	6.4	5330.0	9605.0
258	Flevoland	2002	2.0	5347.0	6636.0
259	Flevoland	2003	7.0	5452.0	5849.0
260	Flevoland	2004	3.4	5302.0	3376.0
261	Flevoland	2005	3.6	5290.0	2603.0
262	Flevoland	2006	8.6	5179.0	1615.0
263	Flevoland	2007	4.7	5219.0	1783.0

264 rows × 5 columns

Then, give some information about the resulting dataframe.

# your code here

df.describe()

	Periods	GDPVolumeChanges_1	LiveBornChildren_2	NetMigrationExcludingAdministrative_19
count	264.000000	264.000000	264.000000	264.000000
mean	2006.500000	1.954924	15572.280303	4499.102273
std	6.356339	2.875221	11710.891211	5658.304660
min	1996.000000	-8.300000	3439.000000	-2831.000000
25%	2001.000000	0.300000	5582.000000	863.750000
50%	2006.500000	2.250000	12096.000000	2407.000000
75%	2012.000000	3.525000	24250.000000	6343.250000
max	2017.000000	11.400000	44022.000000	31545.000000

df

	Regions	Periods	GDPVolumeChanges_1	LiveBornChildren_2	NetMigrationExcludingAdministrative_19
0	Groningen	1996	9.3	6148.0	-336.0
1	Groningen	1997	-2.0	6336.0	-647.0
2	Groningen	2006	1.1	5838.0	65.0
3	Flevoland	2008	-0.8	5101.0	1984.0
4	Flevoland	2009	-5.4	5292.0	1519.0
5	Flevoland	2010	3.2	5310.0	1359.0
6	Flevoland	2011	2.0	5090.0	1162.0
7	Flevoland	2012	-1.0	4991.0	339.0
8	Flevoland	2013	-2.6	4687.0	-234.0
9	Flevoland	2014	3.0	4922.0	-685.0
10	Flevoland	2015	2.9	4735.0	475.0
11	Flevoland	2016	2.6	4706.0	1955.0
12	Flevoland	2017	4.2	4565.0	2045.0
13	Groningen	2007	-2.1	5677.0	234.0
14	Gelderland	1996	2.9	23171.0	4434.0
15	Gelderland	1997	5.1	23461.0	3860.0
16	Gelderland	1998	4.3	24425.0	4213.0
17	Gelderland	1999	3.9	24508.0	5521.0
18	Gelderland	2000	4.5	25370.0	7517.0
19	Gelderland	2001	2.6	24626.0	8246.0
20	Gelderland	2002	-0.6	24826.0	5530.0
21	Gelderland	2003	0.6	24271.0	2037.0
22	Gelderland	2004	0.9	23235.0	1708.0
23	Gelderland	2005	1.8	22202.0	437.0
24	Groningen	2008	8.7	5908.0	827.0
25	Gelderland	2006	5.2	22213.0	-282.0
26	Gelderland	2007	3.2	21191.0	1434.0
27	Gelderland	2008	1.8	21695.0	3614.0
28	Gelderland	2009	-2.3	21350.0	4736.0
29	Gelderland	2010	-0.2	21142.0	4866.0
...	...	...	...	...	...
234	Overijssel	2002	-0.8	14663.0	3024.0
235	Overijssel	2003	0.9	14857.0	1254.0
236	Overijssel	2004	1.4	14098.0	367.0
237	Overijssel	2005	2.6	13942.0	1081.0
238	Overijssel	2006	3.1	13667.0	275.0
239	Overijssel	2007	4.1	13368.0	363.0
240	Overijssel	2008	3.5	13411.0	1747.0
241	Overijssel	2009	-2.0	13260.0	1763.0
242	Groningen	2004	2.2	6141.0	1273.0
243	Overijssel	2010	-0.1	13180.0	1620.0
244	Overijssel	2011	3.3	12651.0	1316.0
245	Overijssel	2012	-3.4	12367.0	-161.0
246	Overijssel	2013	-1.1	12151.0	-812.0
247	Overijssel	2014	1.4	12092.0	-360.0
248	Overijssel	2015	2.8	11765.0	2694.0
249	Overijssel	2016	2.7	11728.0	3110.0
250	Overijssel	2017	3.4	11345.0	3629.0
251	Flevoland	1996	3.3	4170.0	6331.0
252	Flevoland	1997	8.2	4365.0	9135.0
253	Groningen	2005	-0.7	5943.0	-33.0
254	Flevoland	1998	7.5	4785.0	10310.0
255	Flevoland	1999	11.4	4933.0	7733.0
256	Flevoland	2000	5.3	5064.0	8728.0
257	Flevoland	2001	6.4	5330.0	9605.0
258	Flevoland	2002	2.0	5347.0	6636.0
259	Flevoland	2003	7.0	5452.0	5849.0
260	Flevoland	2004	3.4	5302.0	3376.0
261	Flevoland	2005	3.6	5290.0	2603.0
262	Flevoland	2006	8.6	5179.0	1615.0
263	Flevoland	2007	4.7	5219.0	1783.0

264 rows × 5 columns

Setting an index¶

While our columns have a descriptive names (headers), our rows don’t right now. They are just numbers. However, we could actually give them meaningful names. A nice side-effect is that you will get better plots, with meaningful axis labels later on.

df.index=df['Periods']

See the difference?

df.head()

	Regions	Periods	GDPVolumeChanges_1	LiveBornChildren_2	NetMigrationExcludingAdministrative_19
Periods
1996	Groningen	1996	9.3	6148.0	-336.0
1997	Groningen	1997	-2.0	6336.0	-647.0
2006	Groningen	2006	1.1	5838.0	65.0
2008	Flevoland	2008	-0.8	5101.0	1984.0
2009	Flevoland	2009	-5.4	5292.0	1519.0

Analyze the data¶

Let’s train a bit with .groupby() and .agg().

df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f86517e5fd0>

../_images/04-join_and_aggregate_25_1.png

df['GDPVolumeChanges_1'].plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f864f63af28>

../_images/04-join_and_aggregate_26_1.png

Discuss: Why does the above not work?¶

OK, got it?

Let’s try this instead:

df[['GDPVolumeChanges_1','Regions']].groupby(
    'Regions').agg(np.mean).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f864dc8c0f0>

../_images/04-join_and_aggregate_29_1.png

df['LiveBornChildren_2'].groupby('Periods').agg(sum).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f864d83e4a8>

../_images/04-join_and_aggregate_30_1.png

Discuss: which aggregation function?¶

Why did we choose np.mean?
What function should we choose for analyzing df['LiveBornChildren_2']? Why?

Some more example code for plotting, feel free to play around¶

Pay attention to what works well and what doesn’t, and how you can use

groupby and/or
subsetting

to make plots clearer.

df.groupby('Regions')['LiveBornChildren_2'].plot()
df.groupby('Regions')['GDPVolumeChanges_1'].plot(secondary_y=True)

Regions
Drenthe          AxesSubplot(0.125,0.125;0.775x0.755)
Flevoland        AxesSubplot(0.125,0.125;0.775x0.755)
Friesland        AxesSubplot(0.125,0.125;0.775x0.755)
Gelderland       AxesSubplot(0.125,0.125;0.775x0.755)
Groningen        AxesSubplot(0.125,0.125;0.775x0.755)
Limburg          AxesSubplot(0.125,0.125;0.775x0.755)
Noord-Brabant    AxesSubplot(0.125,0.125;0.775x0.755)
Noord-Holland    AxesSubplot(0.125,0.125;0.775x0.755)
Overijssel       AxesSubplot(0.125,0.125;0.775x0.755)
Utrecht          AxesSubplot(0.125,0.125;0.775x0.755)
Zeeland          AxesSubplot(0.125,0.125;0.775x0.755)
Zuid-Holland     AxesSubplot(0.125,0.125;0.775x0.755)
Name: GDPVolumeChanges_1, dtype: object

../_images/04-join_and_aggregate_34_1.png

df.groupby(df.index)['LiveBornChildren_2'].agg(sum).plot(legend = True)
df.groupby(df.index)['GDPVolumeChanges_1'].agg(np.mean).plot(legend=True, secondary_y=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f864d69a630>

../_images/04-join_and_aggregate_35_1.png

df.groupby('Regions')['NetMigrationExcludingAdministrative_19'].plot(legend=True, figsize = [10,10] )

Regions
Drenthe          AxesSubplot(0.125,0.125;0.775x0.755)
Flevoland        AxesSubplot(0.125,0.125;0.775x0.755)
Friesland        AxesSubplot(0.125,0.125;0.775x0.755)
Gelderland       AxesSubplot(0.125,0.125;0.775x0.755)
Groningen        AxesSubplot(0.125,0.125;0.775x0.755)
Limburg          AxesSubplot(0.125,0.125;0.775x0.755)
Noord-Brabant    AxesSubplot(0.125,0.125;0.775x0.755)
Noord-Holland    AxesSubplot(0.125,0.125;0.775x0.755)
Overijssel       AxesSubplot(0.125,0.125;0.775x0.755)
Utrecht          AxesSubplot(0.125,0.125;0.775x0.755)
Zeeland          AxesSubplot(0.125,0.125;0.775x0.755)
Zuid-Holland     AxesSubplot(0.125,0.125;0.775x0.755)
Name: NetMigrationExcludingAdministrative_19, dtype: object

../_images/04-join_and_aggregate_36_1.png

df[df['Regions']=='Flevoland']['NetMigrationExcludingAdministrative_19'].plot(legend=False, figsize = [4,4] )
df[df['Regions']=='Zuid-Holland']['NetMigrationExcludingAdministrative_19'].plot(legend=False )

<matplotlib.axes._subplots.AxesSubplot at 0x7f864d5e2ba8>

../_images/04-join_and_aggregate_37_1.png

df['Regions']=='Flevoland'

Periods
  False
  False
  False
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
        ...  
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
   True
   True
  False
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
Name: Regions, Length: 264, dtype: bool

df.groupby(df.index)['NetMigrationExcludingAdministrative_19'].agg(sum).plot(legend = True)
df.groupby(df.index)['GDPVolumeChanges_1'].agg(np.mean).plot(legend=True, secondary_y=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f864d4d17b8>

../_images/04-join_and_aggregate_39_1.png

Discuss¶

I personally find this last plot a pretty cool one. Do you agree?

df[['NetMigrationExcludingAdministrative_19','GDPVolumeChanges_1']].corr() # we probably should have lagged one of the variables by a year or so for this.

	NetMigrationExcludingAdministrative_19	GDPVolumeChanges_1
NetMigrationExcludingAdministrative_19	1.000000	0.108005
GDPVolumeChanges_1	0.108005	1.000000

Correlational analysis¶

We could also look into some bivariate plots….

df.plot(y='LiveBornChildren_2', x='GDPVolumeChanges_1', kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x7f864d4d1978>

../_images/04-join_and_aggregate_44_1.png

sns.lmplot(y='LiveBornChildren_2', x='GDPVolumeChanges_1', data=df,
           fit_reg=True, lowess=False, robust=True) 

<seaborn.axisgrid.FacetGrid at 0x7f864d42fb38>

../_images/04-join_and_aggregate_45_1.png

3. Basic Statistics

5. Python Data Wrangling II

Data Journalism