1. On Jupyter Notebooks¶

Written by Damian Trilling, Penny Sheets, Frederic Hopp

This notebook is meant to show you what you can do with a Jupyter Notebook. Feel free to play around!

Downloading this notebook¶

You can download any notebook from this site simply by navigating to the download button on the top-right corner and then right-clicking on the .ipynb button and selecting “save link as”. This allows you to directly save this notebook to a location on your computer. Alternatively, you can find all of these notebooks on the corresponding Github page. Github is a very popular platform to share code and it has a central role in open-source software development and in open science. People who want to make their data analyses transparent usually share their code here.

To execute this notebook on your local machine, make sure to open FROM WITHIN a Jupyter Notebook session on your computer.

Cell types¶

There are different types of cells: Code cells and Markdown cells (there are two more, but they are not necessary for our purposes). Markdown/Text cells contain text, and code cells contain, well, code. You can edit a cell by double-clicking on it.

To ‘run’ a cell in Anaconda (in the case of markdown/text, to format it), press CTRL-Enter. (You can also hit ‘run’ up in the menu buttons.)

Note: Depending on your keyboard layout and configuration, cells might also be exectuable by hitting SHIFT+Enter.

Try it out now! Double click on this cell, and then hit CTRL-Enter to format it in Anaconda.

(Tip If you press CTRL-Enter and nothing happens, then it most likely means you’re not in a Markdown/Text cell, but some other kind of cell.)

If you want to know more about formatting with markdown, have a look at https://guides.github.com/features/mastering-markdown/

This is a markdown cell. Write whatever you want in it.

To create a new cell in Anaconda¶

…You hit the ‘plus sign’ button up on the top left of jupyter notebook, at the toolbar. By default, this is a code cell. But you can change it to a markdown cell at the dropdown menu to the right of that same toolbar. Try creating a new markdown cell below this. Don’t forget control+enter to format it. (You can also move cells around, using the up and down arrows up in the menus above.)

Note that there are various ways to format things; using hashtags allows for bigger, bolder fonts.¶

More hashtags, smaller fonts, but still bigger and bolder than no-hashtags.¶

Try creating your new markdown/text cell below here.

Running Python code in Jupyter¶

Now we can start with some actual python commands, actual code, instead of markdown/text.

Let’s try to print something… don’t forget to hit control+enter to run the command (or run it through a button).

Once a line of code has been run, a number appears next to it (in sequential order). This can be very useful for knowing whether you’ve run code already or not - relevant to when we install certain packages, e.g., or you search for or name specific datasets.

print('Hello world')

Hello world

Now create your own print command in the next cell. Print whatever you want. The key is to make sure you format the command correctly - you need parentheses and quotation marks, and to be sure all are closed out afterward. Python helps you with this quite a bit (for example, look at how the colors change if you format something (in-)correctly), but, you have to practice.

#Note, in a code cell, you can also preface a line of text with a hashtag, and python will ignore it as code.
#This can be useful for very short notes within particular code cells, rather than creating separate markdown/text cells.
#But watch the lenghth of your cells!

Python also allows us to do very simple calculations. Just tell it the values and make it do the work:

a = 5
b = 10
c = a + b
print(c)

Many commands have output, but some won’t (just loading a package of tools, or a dataset, e.g.). But remember to check the number next to the cell to see if you’ve actually run those cells or not. Often, you have to start back at the begining because you missed a simple but important step.

Good to know: you can always clear what you’ve done and re-run various (or all) cells, by using the “cell” menu in Anaconda.

Note If you clear everything, then all imported data and modules (see next point) are also cleared. So you can’t just start running your commands in the middle of the notebook; you’ll often have to go back to the earlier cells to start from scratch. (Since the code is already written, this takes literally only seconds sometimes to get back to where you were.)

Importing Modules¶

Because we want to do a lot more than printing words and running simple calculations, we can import modules that help us do fancier things – and in particular, help us to read data easily. Our main module in this course is called “pandas”. Whenever you import anything into Python, it needs to have a name. You can either leave the original name - pandas - (by just typing import pandas) or shorten that name so you don’t have to type it again and again and again. So one shorthand that is commonly used is pd for pandas. Try importing it now with the following command. You’ll see that no output appears, but you should - if you’ve run it correctly - end up with a number next to the command line.

import pandas as pd

We’ll explain a bit more about pandas in class, but, pandas basically allows us to work with data more easily. So anytime you see a command that follows here that has ‘pd’ in the line of code, it means pandas is at work.

Here, pandas can help us read in a dataset from the web, just a random dataset that is often used to illustrate things like statistical programming. The command is simply pd.read_ and then the type of the file, and its url. In this case, it is a csv file, but python can also read many other types of files–we’ll address more of those in a minute.

In this case, the dataset comes from the url listed here, and, because the dataset is originally called ‘iris’, we will also tell python to call the dataset iris. But we could call it anything else we want.

As for the second line of code here, if we just type the name of the dataset after having read it into jupyter, it also displays a bunch of the dataset for us to see. This is handy, as long as you don’t have insanely huge datasets.

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Methods¶

In Python, basically everything is an “object”. That’s also why we assign names to them: It makes the objects re-usable. In the cell above, we created an object (more specifically, a pandas dataframe) that we called iris.

Objects can have “methods” that are associated with them. Pandas dataframes, for example, have some methods that allow you to directly run some simple analyses on them. One of them is .describe().

Note the () at the end. If you want to “call” (= execute, run) a method, you need to end with these parentheses. They also allow you to give some additional “arguments” (parameters, options). Compare the following two method calls:

iris.describe()

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

iris.describe(percentiles=[0.1, 0.9])

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
10%	4.800000	2.500000	1.400000	0.200000
50%	5.800000	3.000000	4.350000	1.300000
90%	6.900000	3.610000	5.800000	2.200000
max	7.900000	4.400000	6.900000	2.500000

One more note: as with SPSS and syntax help, python is happy to help you. You can type a command and then put a question mark after it, and it’ll explain that command to you. Try it here:

iris.describe?

Signature:
iris.describe(
    percentiles=None,
    include=None,
    exclude=None,
    datetime_is_numeric=False,
) -> 'FrameOrSeries'
Docstring:
Generate descriptive statistics.

Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a
dataset's distribution, excluding ``NaN`` values.

Analyzes both numeric and object series, as well
as ``DataFrame`` column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.

Parameters
----------
percentiles : list-like of numbers, optional
    The percentiles to include in the output. All should
    fall between 0 and 1. The default is
    ``[.25, .5, .75]``, which returns the 25th, 50th, and
    75th percentiles.
include : 'all', list-like of dtypes or None (default), optional
    A white list of data types to include in the result. Ignored
    for ``Series``. Here are the options:

    - 'all' : All columns of the input will be included in the output.
    - A list-like of dtypes : Limits the results to the
      provided data types.
      To limit the result to numeric types submit
      ``numpy.number``. To limit it instead to object columns submit
      the ``numpy.object`` data type. Strings
      can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      select pandas categorical columns, use ``'category'``
    - None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,
    A black list of data types to omit from the result. Ignored
    for ``Series``. Here are the options:

    - A list-like of dtypes : Excludes the provided data types
      from the result. To exclude numeric types submit
      ``numpy.number``. To exclude object columns submit the data
      type ``numpy.object``. Strings can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      exclude pandas categorical columns, use ``'category'``
    - None (default) : The result will exclude nothing.
datetime_is_numeric : bool, default False
    Whether to treat datetime dtypes as numeric. This affects statistics
    calculated for the column. For DataFrame input, this also
    controls whether datetime columns are included by default.

    .. versionadded:: 1.1.0

Returns
-------
Series or DataFrame
    Summary statistics of the Series or Dataframe provided.

See Also
--------
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the observations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding
    columns based on their dtype.

Notes
-----
For numeric data, the result's index will include ``count``,
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
upper percentiles. By default the lower percentile is ``25`` and the
upper percentile is ``75``. The ``50`` percentile is the
same as the median.

For object data (e.g. strings or timestamps), the result's index
will include ``count``, ``unique``, ``top``, and ``freq``. The ``top``
is the most common value. The ``freq`` is the most common value's
frequency. Timestamps also include the ``first`` and ``last`` items.

If multiple object values have the highest count, then the
``count`` and ``top`` results will be arbitrarily chosen from
among those with the highest count.

For mixed data types provided via a ``DataFrame``, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If ``include='all'`` is provided as an option, the result
will include a union of attributes of each type.

The `include` and `exclude` parameters can be used to limit
which columns in a ``DataFrame`` are analyzed for the output.
The parameters are ignored when analyzing a ``Series``.

Examples
--------
Describing a numeric ``Series``.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical ``Series``.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp ``Series``.

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a ``DataFrame`` regardless of data type.

>>> df.describe(include='all')  # doctest: +SKIP
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a ``DataFrame`` by accessing it as
an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a ``DataFrame`` description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a ``DataFrame`` description.

>>> df.describe(include=[object])  # doctest: +SKIP
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a ``DataFrame`` description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.number])  # doctest: +SKIP
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[object])  # doctest: +SKIP
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
File:      ~/opt/anaconda3/envs/dj21/lib/python3.7/site-packages/pandas/core/generic.py
Type:      method

Functions¶

Next to methods, there are also functions. Just like methods, functions take one or more “arguments” (i.e., some input) between (). They then return some output. But unlike methods, they are not directly associated with an object.

You already know one function: print().

Let’s try out some functions. First, create two objects to play with:

mystring = "Hello world"
mylist = [22, 3, 4]

Now, try out the following functions and try to explain what they do to those objects. What are these functions trying to do?

len(mystring)

len(mylist)

sum(mylist)

Now, let’s combine these techniques. Each string has the method .split() that splits it into words. With this knowledge, can you calculate the number of words in mystring by using the output of this method as input for the len() function?

You see…¶

It’s actually not that difficult. It can seem overwhelming to not know the codes for things, but, that’s what we’re going to teach you. And there are tons of resources online to help, as well (check the resources tab on the left!).

What’s wonderful about jupyter notebook is that we have code, results, and explanation/notes in one single file. You will also format your assignments this way, using markdown cells to provide notes. The ‘output’ or results don’t matter that much in the file itself, because we can always re-run the code each time we open your files. But the markdown and code cells are essential.

We are looking forward to exploring the possibilities of Jupyter Notebook, Python, and Pandas with you in the next weeks!

Schedule

2. Accessing Data

Data Journalism