5. Python Data Wrangling II

Damian Trilling and Penny Sheets

This notebook outlines the retrieval and preprocessing steps we did to construct the files for our examples. Thus, this notebook contains the steps

  • (1) Retrieval

  • (2) Preprocessing,

while the other notebook contains the steps

  • (3) Enrichment

  • (4) Analysis

EXTRA: How did we prepare the dataset?

Population data

Go to https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=37259eng&_theme=1066

Select the following options:

  • Topics: Net migration and live born children (2/25)

  • Sex: only total (1/3)

  • Regions: all provinces (12/1225)

This should give you 1392 rows and result in the file

37259eng_UntypedDataSet_20112018_132903.csv

Also download the metadata for later reference.

Economic data

Go to https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=82800ENG&_theme=1064

Select the following options:

  • Topics: GDP (volume change) (1/2)

  • Economic sectors: A-U all economic activities (1/15)

  • Regions: all provinces (12/77)

This should give you 264 rows and result in the file

82800ENG_UntypedDataSet_20112018_133529.csv

Also download the metadata for later reference.

Preprocessing the data

Population data

import pandas as pd
%matplotlib inline
population = pd.read_csv('37259eng_UntypedDataSet_20112018_132903.csv', delimiter=';')
population.head()
ID Sex Regions Periods LiveBornChildren_2 NetMigrationExcludingAdministrative_19
0 290 T001038 PV20 1960JJ00 8868.0 -1748.0
1 291 T001038 PV20 1961JJ00 9062.0 -1087.0
2 292 T001038 PV20 1962JJ00 9165.0 -623.0
3 293 T001038 PV20 1963JJ00 9518.0 547.0
4 294 T001038 PV20 1964JJ00 9671.0 305.0

We first remove all columns that are not necessary.

population['Sex'].value_counts()
T001038    696
Name: Sex, dtype: int64
population.drop(['Sex','ID'], axis = 1, inplace = True)
population
Regions Periods LiveBornChildren_2 NetMigrationExcludingAdministrative_19
0 PV20 1960JJ00 8868.0 -1748.0
1 PV20 1961JJ00 9062.0 -1087.0
2 PV20 1962JJ00 9165.0 -623.0
3 PV20 1963JJ00 9518.0 547.0
4 PV20 1964JJ00 9671.0 305.0
5 PV20 1965JJ00 9632.0 -148.0
6 PV20 1966JJ00 9558.0 667.0
7 PV20 1967JJ00 9707.0 -1617.0
8 PV20 1968JJ00 9366.0 -2358.0
9 PV20 1969JJ00 10088.0 -1952.0
10 PV20 1970JJ00 9774.0 196.0
11 PV20 1971JJ00 9219.0 -203.0
12 PV20 1972JJ00 8895.0 -190.0
13 PV20 1973JJ00 7776.0 -660.0
14 PV20 1974JJ00 7341.0 925.0
15 PV20 1975JJ00 6952.0 1750.0
16 PV20 1976JJ00 6862.0 2387.0
17 PV20 1977JJ00 6624.0 1781.0
18 PV20 1978JJ00 6716.0 1949.0
19 PV20 1979JJ00 6644.0 823.0
20 PV20 1980JJ00 6898.0 1345.0
21 PV20 1981JJ00 6730.0 569.0
22 PV20 1982JJ00 6648.0 102.0
23 PV20 1983JJ00 6449.0 -612.0
24 PV20 1984JJ00 6366.0 -1625.0
25 PV20 1985JJ00 6537.0 -2135.0
26 PV20 1986JJ00 6551.0 -2756.0
27 PV20 1987JJ00 6515.0 -2720.0
28 PV20 1988JJ00 6298.0 -2679.0
29 PV20 1989JJ00 6284.0 -2204.0
... ... ... ... ...
666 PV31 1988JJ00 12979.0 505.0
667 PV31 1989JJ00 12875.0 1027.0
668 PV31 1990JJ00 13690.0 1906.0
669 PV31 1991JJ00 13479.0 2191.0
670 PV31 1992JJ00 13029.0 1434.0
671 PV31 1993JJ00 12899.0 2810.0
672 PV31 1994JJ00 12700.0 1919.0
673 PV31 1995JJ00 12474.0 2086.0
674 PV31 1996JJ00 12100.0 1476.0
675 PV31 1997JJ00 11984.0 982.0
676 PV31 1998JJ00 12445.0 411.0
677 PV31 1999JJ00 12213.0 1584.0
678 PV31 2000JJ00 12235.0 615.0
679 PV31 2001JJ00 11884.0 642.0
680 PV31 2002JJ00 11484.0 -311.0
681 PV31 2003JJ00 11190.0 -1579.0
682 PV31 2004JJ00 10753.0 -1338.0
683 PV31 2005JJ00 10051.0 -2831.0
684 PV31 2006JJ00 9857.0 -2389.0
685 PV31 2007JJ00 9458.0 -2099.0
686 PV31 2008JJ00 9401.0 986.0
687 PV31 2009JJ00 9741.0 1264.0
688 PV31 2010JJ00 9620.0 2288.0
689 PV31 2011JJ00 9288.0 2666.0
690 PV31 2012JJ00 9109.0 1900.0
691 PV31 2013JJ00 9011.0 1255.0
692 PV31 2014JJ00 8841.0 1176.0
693 PV31 2015JJ00 8739.0 2627.0
694 PV31 2016JJ00 8994.0 4882.0
695 PV31 2017JJ00 8821.0 3955.0

696 rows × 4 columns

The values of the column Regions contain weird spaces at the end:

population.iloc[0,0]
'PV20  '

We are going to remove them:

population['Regions'] = population['Regions'].map(lambda x: x.strip())

By having a look at the metadata (using CTRL-F for looking for PV20), we can find out what the province codes actually mean. Let’s recode that by using a dict to map the keys to more meaningful values.

provinces = {"PV20":"Groningen",
"PV21":"Friesland",
"PV22":"Drenthe",
"PV23":"Overijssel",
"PV24":"Flevoland",
"PV25":"Gelderland",
"PV26":"Utrecht",
"PV27":"Noord-Holland",
"PV28":"Zuid-Holland",
"PV29":"Zeeland",
"PV30":"Noord-Brabant",
"PV31":"Limburg"}
population['Regions'] = population['Regions'].map(provinces)
population.head()
Regions Periods LiveBornChildren_2 NetMigrationExcludingAdministrative_19
0 Groningen 1960 8868.0 -1748.0
1 Groningen 1961 9062.0 -1087.0
2 Groningen 1962 9165.0 -623.0
3 Groningen 1963 9518.0 547.0
4 Groningen 1964 9671.0 305.0

Let’s also represent the Period in a better way. It’s a string now, and only the first four digits are meaningful. Let’s convert these to an integer. Alternatively, we could opt to convert it to a date (a so-called datetime object).

population['Periods'] = population['Periods'].map(lambda x: int(x[:4]))
population.head()
Regions Periods LiveBornChildren_2 NetMigrationExcludingAdministrative_19
0 Groningen 1960 8868.0 -1748.0
1 Groningen 1961 9062.0 -1087.0
2 Groningen 1962 9165.0 -623.0
3 Groningen 1963 9518.0 547.0
4 Groningen 1964 9671.0 305.0

Let’s save this:

population.to_csv('population.csv')
population.to_json('population.json')

Economic data

We just do exactly the same for our economic dataset

economy = pd.read_csv('82800ENG_UntypedDataSet_20112018_133529.csv', delimiter=';')
economy
ID EconomicSectorsSIC2008 Regions Periods GDPVolumeChanges_1
0 132 T001081 PV20 1996JJ00 9.3
1 133 T001081 PV20 1997JJ00 -2.0
2 134 T001081 PV20 1998JJ00 -0.9
3 135 T001081 PV20 1999JJ00 -0.7
4 136 T001081 PV20 2000JJ00 1.5
5 137 T001081 PV20 2001JJ00 3.9
6 138 T001081 PV20 2002JJ00 2.1
7 139 T001081 PV20 2003JJ00 2.3
8 140 T001081 PV20 2004JJ00 2.2
9 141 T001081 PV20 2005JJ00 -0.7
10 142 T001081 PV20 2006JJ00 1.1
11 143 T001081 PV20 2007JJ00 -2.1
12 144 T001081 PV20 2008JJ00 8.7
13 145 T001081 PV20 2009JJ00 -4.2
14 146 T001081 PV20 2010JJ00 10.4
15 147 T001081 PV20 2011JJ00 -2.9
16 148 T001081 PV20 2012JJ00 -0.5
17 149 T001081 PV20 2013JJ00 5.7
18 150 T001081 PV20 2014JJ00 -7.6
19 151 T001081 PV20 2015JJ00 -8.3
20 152 T001081 PV20 2016JJ00 -1.8
21 153 T001081 PV20 2017JJ00 -0.6
22 154 T001081 PV21 1996JJ00 2.6
23 155 T001081 PV21 1997JJ00 2.6
24 156 T001081 PV21 1998JJ00 6.8
25 157 T001081 PV21 1999JJ00 3.0
26 158 T001081 PV21 2000JJ00 3.4
27 159 T001081 PV21 2001JJ00 2.5
28 160 T001081 PV21 2002JJ00 -1.9
29 161 T001081 PV21 2003JJ00 1.1
... ... ... ... ... ...
234 366 T001081 PV30 2010JJ00 3.3
235 367 T001081 PV30 2011JJ00 3.5
236 368 T001081 PV30 2012JJ00 -0.6
237 369 T001081 PV30 2013JJ00 -0.9
238 370 T001081 PV30 2014JJ00 1.9
239 371 T001081 PV30 2015JJ00 3.4
240 372 T001081 PV30 2016JJ00 2.5
241 373 T001081 PV30 2017JJ00 3.7
242 374 T001081 PV31 1996JJ00 2.5
243 375 T001081 PV31 1997JJ00 5.8
244 376 T001081 PV31 1998JJ00 4.9
245 377 T001081 PV31 1999JJ00 5.5
246 378 T001081 PV31 2000JJ00 2.7
247 379 T001081 PV31 2001JJ00 1.6
248 380 T001081 PV31 2002JJ00 0.4
249 381 T001081 PV31 2003JJ00 -0.8
250 382 T001081 PV31 2004JJ00 2.8
251 383 T001081 PV31 2005JJ00 0.3
252 384 T001081 PV31 2006JJ00 2.9
253 385 T001081 PV31 2007JJ00 4.5
254 386 T001081 PV31 2008JJ00 0.3
255 387 T001081 PV31 2009JJ00 -5.3
256 388 T001081 PV31 2010JJ00 2.2
257 389 T001081 PV31 2011JJ00 2.2
258 390 T001081 PV31 2012JJ00 -1.5
259 391 T001081 PV31 2013JJ00 -0.7
260 392 T001081 PV31 2014JJ00 0.2
261 393 T001081 PV31 2015JJ00 2.7
262 394 T001081 PV31 2016JJ00 2.2
263 395 T001081 PV31 2017JJ00 3.2

264 rows × 5 columns

# We only downloaded the total, so we can safely delete:
economy['EconomicSectorsSIC2008'].value_counts()
T001081       264
Name: EconomicSectorsSIC2008, dtype: int64
economy.drop(['EconomicSectorsSIC2008','ID'], axis = 1, inplace = True)
economy['Regions'] = economy['Regions'].map(lambda x: x.strip())
economy['Regions'] = economy['Regions'].map(provinces)
economy['Periods'] = economy['Periods'].map(lambda x: int(x[:4]))
economy.head()
Regions Periods GDPVolumeChanges_1
0 Groningen 1996 9.3
1 Groningen 1997 -2.0
2 Groningen 1998 -0.9
3 Groningen 1999 -0.7
4 Groningen 2000 1.5
economy.to_csv('economy.csv')
economy.to_json('economy.json')