5. Python Data Wrangling II¶
Damian Trilling and Penny Sheets
This notebook outlines the retrieval and preprocessing steps we did to construct the files for our examples. Thus, this notebook contains the steps
(1) Retrieval
(2) Preprocessing,
while the other notebook contains the steps
(3) Enrichment
(4) Analysis
EXTRA: How did we prepare the dataset?¶
Population data¶
Go to https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=37259eng&_theme=1066
Select the following options:
Topics: Net migration and live born children (2/25)
Sex: only total (1/3)
Regions: all provinces (12/1225)
This should give you 1392 rows and result in the file
37259eng_UntypedDataSet_20112018_132903.csv
Also download the metadata for later reference.
Economic data¶
Go to https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=82800ENG&_theme=1064
Select the following options:
Topics: GDP (volume change) (1/2)
Economic sectors: A-U all economic activities (1/15)
Regions: all provinces (12/77)
This should give you 264 rows and result in the file
82800ENG_UntypedDataSet_20112018_133529.csv
Also download the metadata for later reference.
Preprocessing the data¶
Population data¶
import pandas as pd
%matplotlib inline
population = pd.read_csv('37259eng_UntypedDataSet_20112018_132903.csv', delimiter=';')
population.head()
ID | Sex | Regions | Periods | LiveBornChildren_2 | NetMigrationExcludingAdministrative_19 | |
---|---|---|---|---|---|---|
0 | 290 | T001038 | PV20 | 1960JJ00 | 8868.0 | -1748.0 |
1 | 291 | T001038 | PV20 | 1961JJ00 | 9062.0 | -1087.0 |
2 | 292 | T001038 | PV20 | 1962JJ00 | 9165.0 | -623.0 |
3 | 293 | T001038 | PV20 | 1963JJ00 | 9518.0 | 547.0 |
4 | 294 | T001038 | PV20 | 1964JJ00 | 9671.0 | 305.0 |
We first remove all columns that are not necessary.
population['Sex'].value_counts()
T001038 696
Name: Sex, dtype: int64
population.drop(['Sex','ID'], axis = 1, inplace = True)
population
Regions | Periods | LiveBornChildren_2 | NetMigrationExcludingAdministrative_19 | |
---|---|---|---|---|
0 | PV20 | 1960JJ00 | 8868.0 | -1748.0 |
1 | PV20 | 1961JJ00 | 9062.0 | -1087.0 |
2 | PV20 | 1962JJ00 | 9165.0 | -623.0 |
3 | PV20 | 1963JJ00 | 9518.0 | 547.0 |
4 | PV20 | 1964JJ00 | 9671.0 | 305.0 |
5 | PV20 | 1965JJ00 | 9632.0 | -148.0 |
6 | PV20 | 1966JJ00 | 9558.0 | 667.0 |
7 | PV20 | 1967JJ00 | 9707.0 | -1617.0 |
8 | PV20 | 1968JJ00 | 9366.0 | -2358.0 |
9 | PV20 | 1969JJ00 | 10088.0 | -1952.0 |
10 | PV20 | 1970JJ00 | 9774.0 | 196.0 |
11 | PV20 | 1971JJ00 | 9219.0 | -203.0 |
12 | PV20 | 1972JJ00 | 8895.0 | -190.0 |
13 | PV20 | 1973JJ00 | 7776.0 | -660.0 |
14 | PV20 | 1974JJ00 | 7341.0 | 925.0 |
15 | PV20 | 1975JJ00 | 6952.0 | 1750.0 |
16 | PV20 | 1976JJ00 | 6862.0 | 2387.0 |
17 | PV20 | 1977JJ00 | 6624.0 | 1781.0 |
18 | PV20 | 1978JJ00 | 6716.0 | 1949.0 |
19 | PV20 | 1979JJ00 | 6644.0 | 823.0 |
20 | PV20 | 1980JJ00 | 6898.0 | 1345.0 |
21 | PV20 | 1981JJ00 | 6730.0 | 569.0 |
22 | PV20 | 1982JJ00 | 6648.0 | 102.0 |
23 | PV20 | 1983JJ00 | 6449.0 | -612.0 |
24 | PV20 | 1984JJ00 | 6366.0 | -1625.0 |
25 | PV20 | 1985JJ00 | 6537.0 | -2135.0 |
26 | PV20 | 1986JJ00 | 6551.0 | -2756.0 |
27 | PV20 | 1987JJ00 | 6515.0 | -2720.0 |
28 | PV20 | 1988JJ00 | 6298.0 | -2679.0 |
29 | PV20 | 1989JJ00 | 6284.0 | -2204.0 |
... | ... | ... | ... | ... |
666 | PV31 | 1988JJ00 | 12979.0 | 505.0 |
667 | PV31 | 1989JJ00 | 12875.0 | 1027.0 |
668 | PV31 | 1990JJ00 | 13690.0 | 1906.0 |
669 | PV31 | 1991JJ00 | 13479.0 | 2191.0 |
670 | PV31 | 1992JJ00 | 13029.0 | 1434.0 |
671 | PV31 | 1993JJ00 | 12899.0 | 2810.0 |
672 | PV31 | 1994JJ00 | 12700.0 | 1919.0 |
673 | PV31 | 1995JJ00 | 12474.0 | 2086.0 |
674 | PV31 | 1996JJ00 | 12100.0 | 1476.0 |
675 | PV31 | 1997JJ00 | 11984.0 | 982.0 |
676 | PV31 | 1998JJ00 | 12445.0 | 411.0 |
677 | PV31 | 1999JJ00 | 12213.0 | 1584.0 |
678 | PV31 | 2000JJ00 | 12235.0 | 615.0 |
679 | PV31 | 2001JJ00 | 11884.0 | 642.0 |
680 | PV31 | 2002JJ00 | 11484.0 | -311.0 |
681 | PV31 | 2003JJ00 | 11190.0 | -1579.0 |
682 | PV31 | 2004JJ00 | 10753.0 | -1338.0 |
683 | PV31 | 2005JJ00 | 10051.0 | -2831.0 |
684 | PV31 | 2006JJ00 | 9857.0 | -2389.0 |
685 | PV31 | 2007JJ00 | 9458.0 | -2099.0 |
686 | PV31 | 2008JJ00 | 9401.0 | 986.0 |
687 | PV31 | 2009JJ00 | 9741.0 | 1264.0 |
688 | PV31 | 2010JJ00 | 9620.0 | 2288.0 |
689 | PV31 | 2011JJ00 | 9288.0 | 2666.0 |
690 | PV31 | 2012JJ00 | 9109.0 | 1900.0 |
691 | PV31 | 2013JJ00 | 9011.0 | 1255.0 |
692 | PV31 | 2014JJ00 | 8841.0 | 1176.0 |
693 | PV31 | 2015JJ00 | 8739.0 | 2627.0 |
694 | PV31 | 2016JJ00 | 8994.0 | 4882.0 |
695 | PV31 | 2017JJ00 | 8821.0 | 3955.0 |
696 rows × 4 columns
The values of the column Regions contain weird spaces at the end:
population.iloc[0,0]
'PV20 '
We are going to remove them:
population['Regions'] = population['Regions'].map(lambda x: x.strip())
By having a look at the metadata (using CTRL-F for looking for PV20), we can find out what the province codes actually mean. Let’s recode that by using a dict to map the keys to more meaningful values.
provinces = {"PV20":"Groningen",
"PV21":"Friesland",
"PV22":"Drenthe",
"PV23":"Overijssel",
"PV24":"Flevoland",
"PV25":"Gelderland",
"PV26":"Utrecht",
"PV27":"Noord-Holland",
"PV28":"Zuid-Holland",
"PV29":"Zeeland",
"PV30":"Noord-Brabant",
"PV31":"Limburg"}
population['Regions'] = population['Regions'].map(provinces)
population.head()
Regions | Periods | LiveBornChildren_2 | NetMigrationExcludingAdministrative_19 | |
---|---|---|---|---|
0 | Groningen | 1960 | 8868.0 | -1748.0 |
1 | Groningen | 1961 | 9062.0 | -1087.0 |
2 | Groningen | 1962 | 9165.0 | -623.0 |
3 | Groningen | 1963 | 9518.0 | 547.0 |
4 | Groningen | 1964 | 9671.0 | 305.0 |
Let’s also represent the Period in a better way. It’s a string now, and only the first four digits are meaningful. Let’s convert these to an integer. Alternatively, we could opt to convert it to a date (a so-called datetime object).
population['Periods'] = population['Periods'].map(lambda x: int(x[:4]))
population.head()
Regions | Periods | LiveBornChildren_2 | NetMigrationExcludingAdministrative_19 | |
---|---|---|---|---|
0 | Groningen | 1960 | 8868.0 | -1748.0 |
1 | Groningen | 1961 | 9062.0 | -1087.0 |
2 | Groningen | 1962 | 9165.0 | -623.0 |
3 | Groningen | 1963 | 9518.0 | 547.0 |
4 | Groningen | 1964 | 9671.0 | 305.0 |
Let’s save this:
population.to_csv('population.csv')
population.to_json('population.json')
Economic data¶
We just do exactly the same for our economic dataset
economy = pd.read_csv('82800ENG_UntypedDataSet_20112018_133529.csv', delimiter=';')
economy
ID | EconomicSectorsSIC2008 | Regions | Periods | GDPVolumeChanges_1 | |
---|---|---|---|---|---|
0 | 132 | T001081 | PV20 | 1996JJ00 | 9.3 |
1 | 133 | T001081 | PV20 | 1997JJ00 | -2.0 |
2 | 134 | T001081 | PV20 | 1998JJ00 | -0.9 |
3 | 135 | T001081 | PV20 | 1999JJ00 | -0.7 |
4 | 136 | T001081 | PV20 | 2000JJ00 | 1.5 |
5 | 137 | T001081 | PV20 | 2001JJ00 | 3.9 |
6 | 138 | T001081 | PV20 | 2002JJ00 | 2.1 |
7 | 139 | T001081 | PV20 | 2003JJ00 | 2.3 |
8 | 140 | T001081 | PV20 | 2004JJ00 | 2.2 |
9 | 141 | T001081 | PV20 | 2005JJ00 | -0.7 |
10 | 142 | T001081 | PV20 | 2006JJ00 | 1.1 |
11 | 143 | T001081 | PV20 | 2007JJ00 | -2.1 |
12 | 144 | T001081 | PV20 | 2008JJ00 | 8.7 |
13 | 145 | T001081 | PV20 | 2009JJ00 | -4.2 |
14 | 146 | T001081 | PV20 | 2010JJ00 | 10.4 |
15 | 147 | T001081 | PV20 | 2011JJ00 | -2.9 |
16 | 148 | T001081 | PV20 | 2012JJ00 | -0.5 |
17 | 149 | T001081 | PV20 | 2013JJ00 | 5.7 |
18 | 150 | T001081 | PV20 | 2014JJ00 | -7.6 |
19 | 151 | T001081 | PV20 | 2015JJ00 | -8.3 |
20 | 152 | T001081 | PV20 | 2016JJ00 | -1.8 |
21 | 153 | T001081 | PV20 | 2017JJ00 | -0.6 |
22 | 154 | T001081 | PV21 | 1996JJ00 | 2.6 |
23 | 155 | T001081 | PV21 | 1997JJ00 | 2.6 |
24 | 156 | T001081 | PV21 | 1998JJ00 | 6.8 |
25 | 157 | T001081 | PV21 | 1999JJ00 | 3.0 |
26 | 158 | T001081 | PV21 | 2000JJ00 | 3.4 |
27 | 159 | T001081 | PV21 | 2001JJ00 | 2.5 |
28 | 160 | T001081 | PV21 | 2002JJ00 | -1.9 |
29 | 161 | T001081 | PV21 | 2003JJ00 | 1.1 |
... | ... | ... | ... | ... | ... |
234 | 366 | T001081 | PV30 | 2010JJ00 | 3.3 |
235 | 367 | T001081 | PV30 | 2011JJ00 | 3.5 |
236 | 368 | T001081 | PV30 | 2012JJ00 | -0.6 |
237 | 369 | T001081 | PV30 | 2013JJ00 | -0.9 |
238 | 370 | T001081 | PV30 | 2014JJ00 | 1.9 |
239 | 371 | T001081 | PV30 | 2015JJ00 | 3.4 |
240 | 372 | T001081 | PV30 | 2016JJ00 | 2.5 |
241 | 373 | T001081 | PV30 | 2017JJ00 | 3.7 |
242 | 374 | T001081 | PV31 | 1996JJ00 | 2.5 |
243 | 375 | T001081 | PV31 | 1997JJ00 | 5.8 |
244 | 376 | T001081 | PV31 | 1998JJ00 | 4.9 |
245 | 377 | T001081 | PV31 | 1999JJ00 | 5.5 |
246 | 378 | T001081 | PV31 | 2000JJ00 | 2.7 |
247 | 379 | T001081 | PV31 | 2001JJ00 | 1.6 |
248 | 380 | T001081 | PV31 | 2002JJ00 | 0.4 |
249 | 381 | T001081 | PV31 | 2003JJ00 | -0.8 |
250 | 382 | T001081 | PV31 | 2004JJ00 | 2.8 |
251 | 383 | T001081 | PV31 | 2005JJ00 | 0.3 |
252 | 384 | T001081 | PV31 | 2006JJ00 | 2.9 |
253 | 385 | T001081 | PV31 | 2007JJ00 | 4.5 |
254 | 386 | T001081 | PV31 | 2008JJ00 | 0.3 |
255 | 387 | T001081 | PV31 | 2009JJ00 | -5.3 |
256 | 388 | T001081 | PV31 | 2010JJ00 | 2.2 |
257 | 389 | T001081 | PV31 | 2011JJ00 | 2.2 |
258 | 390 | T001081 | PV31 | 2012JJ00 | -1.5 |
259 | 391 | T001081 | PV31 | 2013JJ00 | -0.7 |
260 | 392 | T001081 | PV31 | 2014JJ00 | 0.2 |
261 | 393 | T001081 | PV31 | 2015JJ00 | 2.7 |
262 | 394 | T001081 | PV31 | 2016JJ00 | 2.2 |
263 | 395 | T001081 | PV31 | 2017JJ00 | 3.2 |
264 rows × 5 columns
# We only downloaded the total, so we can safely delete:
economy['EconomicSectorsSIC2008'].value_counts()
T001081 264
Name: EconomicSectorsSIC2008, dtype: int64
economy.drop(['EconomicSectorsSIC2008','ID'], axis = 1, inplace = True)
economy['Regions'] = economy['Regions'].map(lambda x: x.strip())
economy['Regions'] = economy['Regions'].map(provinces)
economy['Periods'] = economy['Periods'].map(lambda x: int(x[:4]))
economy.head()
Regions | Periods | GDPVolumeChanges_1 | |
---|---|---|---|
0 | Groningen | 1996 | 9.3 |
1 | Groningen | 1997 | -2.0 |
2 | Groningen | 1998 | -0.9 |
3 | Groningen | 1999 | -0.7 |
4 | Groningen | 2000 | 1.5 |
economy.to_csv('economy.csv')
economy.to_json('economy.json')