{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Basic Statistics\n", "\n", "*Damian Trilling and Penny Sheets*\n", "\n", "This notebook is designed to show you some ways to use python for basic statistical analysis of numbers, and to explore some methods that go beyond `df.describe()` or `Counter()`, which we used last week. In particular, we are going to look into analyzing numerical data. Next week, we will focus on textual data.\n", "\n", "The dataset we use in this example is a subset of the data presented in Trilling, D. (2013). *Following the news. Patterns of online and offline news use*. PhD thesis, University of Amsterdam. http://hdl.handle.net/11245/1.394551\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import our tools/modules/libraries\n", "\n", "As always, we first import the tools we'll need. Today, we'll use pandas (usually imported as \"pd\"), and something called statsmodels, and something called numpy. We also use matplotlib for some visualizations. A lot of other stuff here we will need for some specific analyses later on; you don't have to worry about all of it right now.\n", "\n", "If you want to learn more about these modules, you can look online for info." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns\n", "import statsmodels as sm\n", "import statsmodels.formula.api as smf\n", "from statsmodels.stats.weightstats import ttest_ind\n", "from scipy.stats import kendalltau\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read data into a dataframe\n", "We will read a dataset based on Trilling (2013). It contains some sociodemographic variables as well as the number of days the respondent uses a specific medium to access information about news and current affairs.\n", "\n", "You should download the dataset (with the 'save page as' method, making sure .txt isn't appended to the file extension) into the same folder as this jupyter notebook: https://raw.githubusercontent.com/damian0604/bdaca/master/ipynb/mediause.csv\n", "\n", "Remember that the 'df' here is arbitrary; last week we used the names 'iris' and 'stockdata' and others; this week we're going more basic and just saying 'df' for dataframe." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# df = pd.read_csv('mediause.csv') # if you downloaded and stored the file locally \n", "df = pd.read_csv('https://raw.githubusercontent.com/damian0604/bdaca/master/ipynb/mediause.csv') # if directly reading it from source " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the .keys() method is way to find out what the columns are in your dataframe. Sometimes they have nice labels already, and sometimes they don't. In this case, we're in luck." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['gender', 'age', 'education', 'radio', 'newspaper', 'tv', 'internet'], dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that for a dataframe or object in python, you can simply type its name in a code cell and python will display it as best it can. (In this case, it works well.) " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderageeducationradionewspapertvinternet
01714.05650
11402.06000
21412.04373
30655.00050
40392.00177
........................
20760495.03660
20770514.07755
20781316.03556
20790586.03310
20801213.02664
\n", "

2081 rows × 7 columns

\n", "
" ], "text/plain": [ " gender age education radio newspaper tv internet\n", "0 1 71 4.0 5 6 5 0\n", "1 1 40 2.0 6 0 0 0\n", "2 1 41 2.0 4 3 7 3\n", "3 0 65 5.0 0 0 5 0\n", "4 0 39 2.0 0 1 7 7\n", "... ... ... ... ... ... .. ...\n", "2076 0 49 5.0 3 6 6 0\n", "2077 0 51 4.0 7 7 5 5\n", "2078 1 31 6.0 3 5 5 6\n", "2079 0 58 6.0 3 3 1 0\n", "2080 1 21 3.0 2 6 6 4\n", "\n", "[2081 rows x 7 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore the dataset\n", "Let's do some descriptive statistics, using the .describe() method we saw last week. This would be important if you wanted to describe the dataset to your audience, for example." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderageeducationradionewspapertvinternet
count2081.0000002081.0000002065.0000002081.0000002081.0000002081.0000002081.000000
mean0.48149946.0735224.2726393.3339743.1110044.1672272.684286
std0.49977818.2674011.6614512.6990822.8530822.5170392.786262
min0.00000013.0000001.0000000.0000000.0000000.0000000.000000
25%0.00000031.0000003.0000000.0000000.0000002.0000000.000000
50%0.00000046.0000004.0000004.0000003.0000005.0000002.000000
75%1.00000061.0000006.0000006.0000006.0000007.0000005.000000
max1.00000095.0000007.0000007.0000007.0000007.0000007.000000
\n", "
" ], "text/plain": [ " gender age education radio newspaper \\\n", "count 2081.000000 2081.000000 2065.000000 2081.000000 2081.000000 \n", "mean 0.481499 46.073522 4.272639 3.333974 3.111004 \n", "std 0.499778 18.267401 1.661451 2.699082 2.853082 \n", "min 0.000000 13.000000 1.000000 0.000000 0.000000 \n", "25% 0.000000 31.000000 3.000000 0.000000 0.000000 \n", "50% 0.000000 46.000000 4.000000 4.000000 3.000000 \n", "75% 1.000000 61.000000 6.000000 6.000000 6.000000 \n", "max 1.000000 95.000000 7.000000 7.000000 7.000000 \n", "\n", " tv internet \n", "count 2081.000000 2081.000000 \n", "mean 4.167227 2.684286 \n", "std 2.517039 2.786262 \n", "min 0.000000 0.000000 \n", "25% 2.000000 0.000000 \n", "50% 5.000000 2.000000 \n", "75% 7.000000 5.000000 \n", "max 7.000000 7.000000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to find out how many possible values there are for a specific variable, you can use the `.value_counts()` method. In this case, you select the dataframe (which we've called `df`), select the column/variable you want to examine, and then apply the method.\n", "\n", "The output shows us that there are two values - 0 and 1 - for the 'gender' variable. It gives us how many instances (aka frequencies) of each of these values exist in the dataset." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1079\n", "1 1002\n", "Name: gender, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['gender'].value_counts()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.0 667\n", "2.0 323\n", "5.0 178\n", "6.0 396\n", "3.0 214\n", "7.0 219\n", "1.0 68\n", "Name: education, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#as with any method, value_counts() has parameters we can adjust.\n", "#by default, the results are sorted by size of the count, but we can\n", "#also allow it to be random if we wanted. Compare the results.\n", "\n", "df['education'].value_counts(sort=False)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.0 667\n", "6.0 396\n", "2.0 323\n", "7.0 219\n", "3.0 214\n", "5.0 178\n", "1.0 68\n", "Name: education, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['education'].value_counts(sort=True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 68\n", "2.0 323\n", "3.0 214\n", "4.0 667\n", "5.0 178\n", "6.0 396\n", "7.0 219\n", "Name: education, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#if it is useful to sort by the index - i.e. days of the week here - then you can specify that as follows:\n", "df['education'].value_counts().sort_index()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m\n", "\u001b[0mtest\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mnormalize\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mascending\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mbins\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Return a Series containing counts of unique values.\n", "\n", "The resulting object will be in descending order so that the\n", "first element is the most frequently-occurring element.\n", "Excludes NA values by default.\n", "\n", "Parameters\n", "----------\n", "normalize : bool, default False\n", " If True then the object returned will contain the relative\n", " frequencies of the unique values.\n", "sort : bool, default True\n", " Sort by frequencies.\n", "ascending : bool, default False\n", " Sort in ascending order.\n", "bins : int, optional\n", " Rather than count values, group them into half-open bins,\n", " a convenience for ``pd.cut``, only works with numeric data.\n", "dropna : bool, default True\n", " Don't include counts of NaN.\n", "\n", "Returns\n", "-------\n", "Series\n", "\n", "See Also\n", "--------\n", "Series.count: Number of non-NA elements in a Series.\n", "DataFrame.count: Number of non-NA elements in a DataFrame.\n", "DataFrame.value_counts: Equivalent method on DataFrames.\n", "\n", "Examples\n", "--------\n", ">>> index = pd.Index([3, 1, 2, 3, 4, np.nan])\n", ">>> index.value_counts()\n", "3.0 2\n", "1.0 1\n", "2.0 1\n", "4.0 1\n", "dtype: int64\n", "\n", "With `normalize` set to `True`, returns the relative frequency by\n", "dividing all values by the sum of values.\n", "\n", ">>> s = pd.Series([3, 1, 2, 3, 4, np.nan])\n", ">>> s.value_counts(normalize=True)\n", "3.0 0.4\n", "1.0 0.2\n", "2.0 0.2\n", "4.0 0.2\n", "dtype: float64\n", "\n", "**bins**\n", "\n", "Bins can be useful for going from a continuous variable to a\n", "categorical variable; instead of counting unique\n", "apparitions of values, divide the index in the specified\n", "number of half-open bins.\n", "\n", ">>> s.value_counts(bins=3)\n", "(0.996, 2.0] 2\n", "(2.0, 3.0] 2\n", "(3.0, 4.0] 1\n", "dtype: int64\n", "\n", "**dropna**\n", "\n", "With `dropna` set to `False` we can also see NaN index values.\n", "\n", ">>> s.value_counts(dropna=False)\n", "3.0 2\n", "1.0 1\n", "2.0 1\n", "4.0 1\n", "NaN 1\n", "dtype: int64\n", "\u001b[0;31mFile:\u001b[0m ~/opt/anaconda3/envs/dj21/lib/python3.7/site-packages/pandas/core/base.py\n", "\u001b[0;31mType:\u001b[0m method\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#You can also use a help command to get python to print info about this method. But in this case, \n", "#you have to make an additional step, because the selected column isn't an object until\n", "#it is officially run in a 'real' command. So you have to turn that into an object, and then ask for help.\n", "\n", "test = df['education']\n", "test.value_counts?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also display value counts for multiple variables at once, to get an overview of your data. In this case, use a loop to replicate commands for each of the four media types. We'll do this next, but we'll also set a few specifications so that it prints out nicely. \n", "\n", "See if you can figure out what each of these print commands is doing." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RADIO\n", "0 0.292167\n", "7 0.199904\n", "5 0.169149\n", "3 0.082653\n", "4 0.075925\n", "2 0.066314\n", "6 0.058145\n", "1 0.055742\n", "Name: radio, dtype: float64\n", "-------------------------------------------\n", "\n", "NEWSPAPER\n", "0 0.356559\n", "6 0.252763\n", "7 0.126862\n", "1 0.081211\n", "2 0.061028\n", "5 0.055262\n", "3 0.038443\n", "4 0.027871\n", "Name: newspaper, dtype: float64\n", "-------------------------------------------\n", "\n", "TV\n", "7 0.271024\n", "5 0.149447\n", "0 0.143681\n", "6 0.112446\n", "4 0.095147\n", "3 0.082653\n", "2 0.074003\n", "1 0.071600\n", "Name: tv, dtype: float64\n", "-------------------------------------------\n", "\n", "INTERNET\n", "0 0.389716\n", "7 0.197021\n", "1 0.090822\n", "2 0.083614\n", "3 0.072081\n", "5 0.069678\n", "4 0.049976\n", "6 0.047093\n", "Name: internet, dtype: float64\n", "-------------------------------------------\n", "\n" ] } ], "source": [ "for medium in ['radio','newspaper','tv','internet']:\n", " print(medium.upper())\n", " print(df[medium].value_counts(sort=True, normalize=True))\n", " print('-------------------------------------------\\n')\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So that's one way to start exploring a dataset generally.\n", "\n", "## Groupby\n", "\n", "Let's say you'd like to compare the media use of men and women in the dataset. Eventually we'll move toward statistical comparison, but for now we can start by looking at their descriptive statistics - separately for men and women.\n", "\n", "In python, this is quite easy, using the `.groupby()` method.\n", "\n", "First, we group the dataframe by the 'gender' variable, and then apply a method to that grouped dataframe; this is called 'chaining' multiple methods together. (We saw a bit of this chaining idea last week already.)\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageeducation...tvinternet
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
gender
01079.044.74420817.50805313.030.044.059.086.01072.04.212687...7.07.01079.02.3707142.6826100.00.01.05.07.0
11002.047.50499018.95603213.033.047.063.095.0993.04.337362...7.07.01002.03.0219562.8568090.00.02.06.07.0
\n", "

2 rows × 48 columns

\n", "
" ], "text/plain": [ " age education \\\n", " count mean std min 25% 50% 75% max count \n", "gender \n", "0 1079.0 44.744208 17.508053 13.0 30.0 44.0 59.0 86.0 1072.0 \n", "1 1002.0 47.504990 18.956032 13.0 33.0 47.0 63.0 95.0 993.0 \n", "\n", " ... tv internet \\\n", " mean ... 75% max count mean std min 25% 50% \n", "gender ... \n", "0 4.212687 ... 7.0 7.0 1079.0 2.370714 2.682610 0.0 0.0 1.0 \n", "1 4.337362 ... 7.0 7.0 1002.0 3.021956 2.856809 0.0 0.0 2.0 \n", "\n", " \n", " 75% max \n", "gender \n", "0 5.0 7.0 \n", "1 6.0 7.0 \n", "\n", "[2 rows x 48 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('gender').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes in this case, it's more useful to transpose the dataset, making columns into rows and vice versa. This display will then be much easier to look at. In this case, we use a .T at the end, after the describe() method. This doesn't change the dataframe in any way, just displays it differently for you here." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gender01
agecount1079.0000001002.000000
mean44.74420847.504990
std17.50805318.956032
min13.00000013.000000
25%30.00000033.000000
50%44.00000047.000000
75%59.00000063.000000
max86.00000095.000000
educationcount1072.000000993.000000
mean4.2126874.337362
std1.6005101.723294
min1.0000001.000000
25%3.0000003.000000
50%4.0000004.000000
75%6.0000006.000000
max7.0000007.000000
radiocount1079.0000001002.000000
mean3.0685823.619760
std2.6976462.672636
min0.0000000.000000
25%0.0000000.000000
50%3.0000004.000000
75%5.0000006.000000
max7.0000007.000000
newspapercount1079.0000001002.000000
mean2.9360523.299401
std2.8380182.858672
min0.0000000.000000
25%0.0000000.000000
50%2.0000003.000000
75%6.0000006.000000
max7.0000007.000000
tvcount1079.0000001002.000000
mean4.0759964.265469
std2.5291932.501425
min0.0000000.000000
25%2.0000002.000000
50%5.0000005.000000
75%7.0000007.000000
max7.0000007.000000
internetcount1079.0000001002.000000
mean2.3707143.021956
std2.6826102.856809
min0.0000000.000000
25%0.0000000.000000
50%1.0000002.000000
75%5.0000006.000000
max7.0000007.000000
\n", "
" ], "text/plain": [ "gender 0 1\n", "age count 1079.000000 1002.000000\n", " mean 44.744208 47.504990\n", " std 17.508053 18.956032\n", " min 13.000000 13.000000\n", " 25% 30.000000 33.000000\n", " 50% 44.000000 47.000000\n", " 75% 59.000000 63.000000\n", " max 86.000000 95.000000\n", "education count 1072.000000 993.000000\n", " mean 4.212687 4.337362\n", " std 1.600510 1.723294\n", " min 1.000000 1.000000\n", " 25% 3.000000 3.000000\n", " 50% 4.000000 4.000000\n", " 75% 6.000000 6.000000\n", " max 7.000000 7.000000\n", "radio count 1079.000000 1002.000000\n", " mean 3.068582 3.619760\n", " std 2.697646 2.672636\n", " min 0.000000 0.000000\n", " 25% 0.000000 0.000000\n", " 50% 3.000000 4.000000\n", " 75% 5.000000 6.000000\n", " max 7.000000 7.000000\n", "newspaper count 1079.000000 1002.000000\n", " mean 2.936052 3.299401\n", " std 2.838018 2.858672\n", " min 0.000000 0.000000\n", " 25% 0.000000 0.000000\n", " 50% 2.000000 3.000000\n", " 75% 6.000000 6.000000\n", " max 7.000000 7.000000\n", "tv count 1079.000000 1002.000000\n", " mean 4.075996 4.265469\n", " std 2.529193 2.501425\n", " min 0.000000 0.000000\n", " 25% 2.000000 2.000000\n", " 50% 5.000000 5.000000\n", " 75% 7.000000 7.000000\n", " max 7.000000 7.000000\n", "internet count 1079.000000 1002.000000\n", " mean 2.370714 3.021956\n", " std 2.682610 2.856809\n", " min 0.000000 0.000000\n", " 25% 0.000000 0.000000\n", " 50% 1.000000 2.000000\n", " 75% 5.000000 6.000000\n", " max 7.000000 7.000000" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('gender').describe().T" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "#try this again here, using a different variable as the grouping variable.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mSignature:\u001b[0m\n", "\u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mby\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Axis'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Level | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mas_index\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mgroup_keys\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0msqueeze\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool | lib.NoDefault'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m<\u001b[0m\u001b[0mno_default\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mobserved\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'DataFrameGroupBy'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Group DataFrame using a mapper or by a Series of columns.\n", "\n", "A groupby operation involves some combination of splitting the\n", "object, applying a function, and combining the results. This can be\n", "used to group large amounts of data and compute operations on these\n", "groups.\n", "\n", "Parameters\n", "----------\n", "by : mapping, function, label, or list of labels\n", " Used to determine the groups for the groupby.\n", " If ``by`` is a function, it's called on each value of the object's\n", " index. If a dict or Series is passed, the Series or dict VALUES\n", " will be used to determine the groups (the Series' values are first\n", " aligned; see ``.align()`` method). If an ndarray is passed, the\n", " values are used as-is to determine the groups. A label or list of\n", " labels may be passed to group by the columns in ``self``. Notice\n", " that a tuple is interpreted as a (single) key.\n", "axis : {0 or 'index', 1 or 'columns'}, default 0\n", " Split along rows (0) or columns (1).\n", "level : int, level name, or sequence of such, default None\n", " If the axis is a MultiIndex (hierarchical), group by a particular\n", " level or levels.\n", "as_index : bool, default True\n", " For aggregated output, return object with group labels as the\n", " index. Only relevant for DataFrame input. as_index=False is\n", " effectively \"SQL-style\" grouped output.\n", "sort : bool, default True\n", " Sort group keys. Get better performance by turning this off.\n", " Note this does not influence the order of observations within each\n", " group. Groupby preserves the order of rows within each group.\n", "group_keys : bool, default True\n", " When calling apply, add group keys to index to identify pieces.\n", "squeeze : bool, default False\n", " Reduce the dimensionality of the return type if possible,\n", " otherwise return a consistent type.\n", "\n", " .. deprecated:: 1.1.0\n", "\n", "observed : bool, default False\n", " This only applies if any of the groupers are Categoricals.\n", " If True: only show observed values for categorical groupers.\n", " If False: show all values for categorical groupers.\n", "dropna : bool, default True\n", " If True, and if group keys contain NA values, NA values together\n", " with row/column will be dropped.\n", " If False, NA values will also be treated as the key in groups\n", "\n", " .. versionadded:: 1.1.0\n", "\n", "Returns\n", "-------\n", "DataFrameGroupBy\n", " Returns a groupby object that contains information about the groups.\n", "\n", "See Also\n", "--------\n", "resample : Convenience method for frequency conversion and resampling\n", " of time series.\n", "\n", "Notes\n", "-----\n", "See the `user guide\n", "`__ for more.\n", "\n", "Examples\n", "--------\n", ">>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", "... 'Parrot', 'Parrot'],\n", "... 'Max Speed': [380., 370., 24., 26.]})\n", ">>> df\n", " Animal Max Speed\n", "0 Falcon 380.0\n", "1 Falcon 370.0\n", "2 Parrot 24.0\n", "3 Parrot 26.0\n", ">>> df.groupby(['Animal']).mean()\n", " Max Speed\n", "Animal\n", "Falcon 375.0\n", "Parrot 25.0\n", "\n", "**Hierarchical Indexes**\n", "\n", "We can groupby different levels of a hierarchical index\n", "using the `level` parameter:\n", "\n", ">>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],\n", "... ['Captive', 'Wild', 'Captive', 'Wild']]\n", ">>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))\n", ">>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},\n", "... index=index)\n", ">>> df\n", " Max Speed\n", "Animal Type\n", "Falcon Captive 390.0\n", " Wild 350.0\n", "Parrot Captive 30.0\n", " Wild 20.0\n", ">>> df.groupby(level=0).mean()\n", " Max Speed\n", "Animal\n", "Falcon 370.0\n", "Parrot 25.0\n", ">>> df.groupby(level=\"Type\").mean()\n", " Max Speed\n", "Type\n", "Captive 210.0\n", "Wild 185.0\n", "\n", "We can also choose to include NA in group keys or not by setting\n", "`dropna` parameter, the default setting is `True`:\n", "\n", ">>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]\n", ">>> df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", "\n", ">>> df.groupby(by=[\"b\"]).sum()\n", " a c\n", "b\n", "1.0 2 3\n", "2.0 2 5\n", "\n", ">>> df.groupby(by=[\"b\"], dropna=False).sum()\n", " a c\n", "b\n", "1.0 2 3\n", "2.0 2 5\n", "NaN 1 4\n", "\n", ">>> l = [[\"a\", 12, 12], [None, 12.3, 33.], [\"b\", 12.3, 123], [\"a\", 1, 1]]\n", ">>> df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", "\n", ">>> df.groupby(by=\"a\").sum()\n", " b c\n", "a\n", "a 13.0 13.0\n", "b 12.3 123.0\n", "\n", ">>> df.groupby(by=\"a\", dropna=False).sum()\n", " b c\n", "a\n", "a 13.0 13.0\n", "b 12.3 123.0\n", "NaN 12.3 33.0\n", "\u001b[0;31mFile:\u001b[0m ~/opt/anaconda3/envs/dj21/lib/python3.7/site-packages/pandas/core/frame.py\n", "\u001b[0;31mType:\u001b[0m method\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#you can use help again here, to figure out all the specifications.\n", "\n", "df.groupby?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, as we did last week, you can plot a simple histogram of the distribution of a variable across the dataset. So if you want to look at how 'radio' (as in, how many days per week a person uses radio) is distributed among your sample, e.g., you can use a histogram." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAASz0lEQVR4nO3dXYwd933e8e9jypFlMhYpyF6wIlGxAGNUsuAXLdQEAoxlmEZMY5i6iAAaiEEHKtgLxZBRASnlmyAXBHyjIIEcFSVMlyzEeMHKEUjYsVuW0cIxEFk2HaU09VKxlitRVLiN9eKsIcig8uvFjtE1uS+Hu3t4OH98P8DizPzPzJxnCO5zZmfnzKaqkCS15V2jDiBJWn2WuyQ1yHKXpAZZ7pLUIMtdkhp0zagDANx444118803L3v9n/70p6xdu3b1Ag1Rn7JCv/KadXj6lLdPWWFleU+ePPkPVfX+eZ+sqpF/3X777bUSTzzxxIrWv5L6lLWqX3nNOjx9ytunrFUrywt8rxboVU/LSFKDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkho0ULknWZ/ksSTPJXk2ya8luSHJ8SQvdI8b5iz/YJIzSZ5Pctfw4kuS5jPo7Qf+FPhmVf1Okl8C3gt8HjhRVV9IshfYC/yHJLcAu4BbgX8G/I8kv1JV7wwhPwCnXnmTz+z9+rA2v6oO7ujPx6Il9deSR+5J3gd8HDgAUFU/q6o3gJ3AoW6xQ8Dd3fROYLKq3q6qF4EzwB2rG1uStJjUEn9mL8lHgP3AM8CHgZPA/cArVbV+znKvV9WGJF8EnqyqR7vxA8A3quqxi7a7B9gDMDY2dvvk5OSyd2L6tTc5/9ayV7+itly/hnXr1o06xsBmZmZ6k9esw9OnvH3KCivLu23btpNVNT7fc4OclrkG+Bjw2ar6TpI/ZfYUzEIyz9gl7yBVtZ/ZNw3Gx8drYmJigCjze/jwUR46dVXc4HJJB3esZSX7eqVNTU31Jq9Zh6dPefuUFYaXd5BfqJ4FzlbVd7r5x5gt+/NJNgJ0j9Nzlt88Z/1NwLnViStJGsSS5V5Vfw+8nOSD3dB2Zk/RHAN2d2O7gaPd9DFgV5Jrk2wBtgJPrWpqSdKiBj2X8VngcHelzA+B32P2jeFIknuBl4B7AKrqdJIjzL4BXADuG+aVMpKkSw1U7lX1NDDfSfvtCyy/D9i3/FiSpJXwE6qS1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAb1469KS9KATr3yJp/Z+/VRxxjYwR1rh7Jdj9wlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDRqo3JP8KMmpJE8n+V43dkOS40le6B43zFn+wSRnkjyf5K5hhZckze9yjty3VdVHqmq8m98LnKiqrcCJbp4ktwC7gFuBHcAjSdasYmZJ0hJWclpmJ3Comz4E3D1nfLKq3q6qF4EzwB0reB1J0mVKVS29UPIi8DpQwH+qqv1J3qiq9XOWeb2qNiT5IvBkVT3ajR8AvlFVj120zT3AHoCxsbHbJycnl70T06+9yfm3lr36FbXl+jWsW7du1DEGNjMz05u8Zh2ePuXtUx/Ayjph27ZtJ+ecTfkFg9447M6qOpfkA8DxJM8tsmzmGbvkHaSq9gP7AcbHx2tiYmLAKJd6+PBRHjrVj3ugHdyxlpXs65U2NTXVm7xmHZ4+5e1TH8DwOmGg0zJVda57nAYeZ/Y0y/kkGwG6x+lu8bPA5jmrbwLOrVZgSdLSliz3JGuT/PLPp4HfBH4AHAN2d4vtBo5208eAXUmuTbIF2Ao8tdrBJUkLG+RnlzHg8SQ/X/7Pq+qbSb4LHElyL/AScA9AVZ1OcgR4BrgA3FdV7wwlvSRpXkuWe1X9EPjwPOM/BrYvsM4+YN+K00mSlsVPqEpSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgwYu9yRrkvxtkq918zckOZ7khe5xw5xlH0xyJsnzSe4aRnBJ0sIu58j9fuDZOfN7gRNVtRU40c2T5BZgF3ArsAN4JMma1YkrSRrEQOWeZBPw28CX5gzvBA5104eAu+eMT1bV21X1InAGuGNV0kqSBjLokfufAH8A/NOcsbGqehWge/xAN34T8PKc5c52Y5KkK+SapRZI8glguqpOJpkYYJuZZ6zm2e4eYA/A2NgYU1NTA2x6fmPXwQO3XVj2+lfSzMzMivb1SutTXrMOT5/y9qkPYHj/tkuWO3An8Mkk/wZ4D/C+JI8C55NsrKpXk2wEprvlzwKb56y/CTh38Uaraj+wH2B8fLwmJiaWvRMPHz7KQ6cG2ZXRO7hjLSvZ1yttamqqN3nNOjx9ytunPoDhdcKSp2Wq6sGq2lRVNzP7i9K/qqrfBY4Bu7vFdgNHu+ljwK4k1ybZAmwFnlr15JKkBa3k7e0LwJEk9wIvAfcAVNXpJEeAZ4ALwH1V9c6Kk0qSBnZZ5V5VU8BUN/1jYPsCy+0D9q0wmyRpmfyEqiQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1KD+3BdT0siceuVNPrP366OOMZAHbht1gquDR+6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNWrLck7wnyVNJ/i7J6SR/1I3fkOR4khe6xw1z1nkwyZkkzye5a5g7IEm61CBH7m8Dv15VHwY+AuxI8qvAXuBEVW0FTnTzJLkF2AXcCuwAHkmyZgjZJUkLWLLca9ZMN/vu7quAncChbvwQcHc3vROYrKq3q+pF4Axwx2qGliQtbqBz7knWJHkamAaOV9V3gLGqehWge/xAt/hNwMtzVj/bjUmSrpBU1eALJ+uBx4HPAt+uqvVznnu9qjYk+TPgb6rq0W78APCXVfXVi7a1B9gDMDY2dvvk5OSyd2L6tTc5/9ayV7+itly/hnXr1o06xsBmZmZ6k9esw9On77Gx6+hNVlhZJ2zbtu1kVY3P99xl/Zm9qnojyRSz59LPJ9lYVa8m2cjsUT3MHqlvnrPaJuDcPNvaD+wHGB8fr4mJicuJ8gsePnyUh0714y8GHtyxlpXs65U2NTXVm7xmHZ4+fY89cNuF3mSF4XXCIFfLvL87YifJdcBvAM8Bx4Dd3WK7gaPd9DFgV5Jrk2wBtgJPrXJuSdIiBnl72wgc6q54eRdwpKq+luRvgCNJ7gVeAu4BqKrTSY4AzwAXgPuq6p3hxJckzWfJcq+q/wl8dJ7xHwPbF1hnH7BvxekkScviJ1QlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1aMlyT7I5yRNJnk1yOsn93fgNSY4neaF73DBnnQeTnEnyfJK7hrkDkqRLDXLkfgF4oKr+JfCrwH1JbgH2Aieqaitwopune24XcCuwA3gkyZphhJckzW/Jcq+qV6vq+930PwLPAjcBO4FD3WKHgLu76Z3AZFW9XVUvAmeAO1Y5tyRpEamqwRdObga+BXwIeKmq1s957vWq2pDki8CTVfVoN34A+EZVPXbRtvYAewDGxsZun5ycXPZOTL/2JuffWvbqV9SW69ewbt26UccY2MzMTG/ymnV4+vQ9NnYdvckKK+uEbdu2nayq8fmeu2bQjSRZB3wV+FxV/STJgovOM3bJO0hV7Qf2A4yPj9fExMSgUS7x8OGjPHRq4F0ZqYM71rKSfb3SpqamepPXrMPTp++xB2670JusMLxOGOhqmSTvZrbYD1fVX3TD55Ns7J7fCEx342eBzXNW3wScW524kqRBDHK1TIADwLNV9cdznjoG7O6mdwNH54zvSnJtki3AVuCp1YssSVrKID+73Al8GjiV5Olu7PPAF4AjSe4FXgLuAaiq00mOAM8we6XNfVX1zmoHlyQtbMlyr6pvM/95dIDtC6yzD9i3glySpBXwE6qS1CDLXZIaZLlLUoP6czGotIRTr7zJZ/Z+fdQxBnJwx9pRR1DjPHKXpAZZ7pLUIMtdkhpkuUtSg/yFqjQCffrlL8ADt406gS6XR+6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQV4KqUX16ZI9L9eT/j+P3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDvBTyCuvTpYXg5YVSX3nkLkkNstwlqUFLlnuSLyeZTvKDOWM3JDme5IXuccOc5x5McibJ80nuGlZwSdLCBjlyPwjsuGhsL3CiqrYCJ7p5ktwC7AJu7dZ5JMmaVUsrSRrIkuVeVd8CXrtoeCdwqJs+BNw9Z3yyqt6uqheBM8AdqxNVkjSoVNXSCyU3A1+rqg91829U1fo5z79eVRuSfBF4sqoe7cYPAN+oqsfm2eYeYA/A2NjY7ZOTk8veienX3uT8W8te/Yoau47eZIV+5TXr8PQpb5+yAmy5fg3r1q1b1rrbtm07WVXj8z232pdCZp6xed89qmo/sB9gfHy8JiYmlv2iDx8+ykOn+nFV5wO3XehNVuhXXrMOT5/y9ikrwMEda1lJ/y1kuVfLnE+yEaB7nO7GzwKb5yy3CTi3/HiSpOVYbrkfA3Z307uBo3PGdyW5NskWYCvw1MoiSpIu15I/uyT5CjAB3JjkLPCHwBeAI0nuBV4C7gGoqtNJjgDPABeA+6rqnSFllyQtYMlyr6pPLfDU9gWW3wfsW0koSdLK+AlVSWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1CDLXZIaZLlLUoMsd0lqkOUuSQ2y3CWpQZa7JDXIcpekBlnuktQgy12SGmS5S1KDLHdJapDlLkkNstwlqUGWuyQ1yHKXpAZZ7pLUIMtdkhpkuUtSgyx3SWrQ0Mo9yY4kzyc5k2TvsF5HknSpoZR7kjXAnwG/BdwCfCrJLcN4LUnSpYZ15H4HcKaqflhVPwMmgZ1Dei1J0kVSVau/0eR3gB1V9W+7+U8D/6qqfn/OMnuAPd3sB4HnV/CSNwL/sIL1r6Q+ZYV+5TXr8PQpb5+ywsry/vOqev98T1yz/DyLyjxjv/AuUlX7gf2r8mLJ96pqfDW2NWx9ygr9ymvW4elT3j5lheHlHdZpmbPA5jnzm4BzQ3otSdJFhlXu3wW2JtmS5JeAXcCxIb2WJOkiQzktU1UXkvw+8N+ANcCXq+r0MF6rsyqnd66QPmWFfuU16/D0KW+fssKQ8g7lF6qSpNHyE6qS1CDLXZIa1Oty79MtDpJ8Ocl0kh+MOstSkmxO8kSSZ5OcTnL/qDMtJsl7kjyV5O+6vH806kxLSbImyd8m+dqosywlyY+SnErydJLvjTrPYpKsT/JYkue6/7+/NupMC0nywe7f9OdfP0nyuVXbfl/PuXe3OPhfwL9m9tLL7wKfqqpnRhpsAUk+DswA/6WqPjTqPItJshHYWFXfT/LLwEng7qv43zbA2qqaSfJu4NvA/VX15IijLSjJvwfGgfdV1SdGnWcxSX4EjFfVVf/BoCSHgL+uqi91V+q9t6reGHGsJXV99gqzH/b8P6uxzT4fuffqFgdV9S3gtVHnGERVvVpV3++m/xF4FrhptKkWVrNmutl3d19X7VFLkk3AbwNfGnWWliR5H/Bx4ABAVf2sD8Xe2Q7879Uqduh3ud8EvDxn/ixXcQH1VZKbgY8C3xlxlEV1pzmeBqaB41V1Nef9E+APgH8acY5BFfDfk5zsbhtytfoXwP8F/nN3yutLSdaOOtSAdgFfWc0N9rncl7zFgVYmyTrgq8Dnquono86zmKp6p6o+wuynoe9IclWe+kryCWC6qk6OOstluLOqPsbsXV7v604xXo2uAT4G/Meq+ijwU+Cq/l0cQHf66JPAf13N7fa53L3FwRB1566/Chyuqr8YdZ5BdT+GTwE7RptkQXcCn+zOY08Cv57k0dFGWlxVnesep4HHmT0lejU6C5yd81PbY8yW/dXut4DvV9X51dxon8vdWxwMSfcLygPAs1X1x6POs5Qk70+yvpu+DvgN4LmRhlpAVT1YVZuq6mZm/8/+VVX97ohjLSjJ2u6X6nSnOH4TuCqv+KqqvwdeTvLBbmg7cFVeBHCRT7HKp2RgeHeFHLoR3OJgRZJ8BZgAbkxyFvjDqjow2lQLuhP4NHCqO48N8Pmq+svRRVrURuBQd8XBu4AjVXXVX2LYE2PA47Pv91wD/HlVfXO0kRb1WeBwd8D3Q+D3RpxnUUney+wVf/9u1bfd10shJUkL6/NpGUnSAix3SWqQ5S5JDbLcJalBlrskNchyl6QGWe6S1KD/Bw4KaXsvZ8VmAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#Here, 'bins' refers to how many bars we want, essentially. If you don't specify, python/pandas will guess based\n", "#on the dataset. This can be misleading. So if you know how many you want to display, you should specify.\n", "\n", "df['radio'].hist(bins=7)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "#Try to plot a histogram of internet news use here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the modules we imported above helps us to make prettier plots (but no, it's not called \"pretty plot\" like \"pretty print\"). Here we can plot the value counts for internet news use in a bar chart, again sorted by the index.\n", "\n", "In particular, the histogram above is very good for continous variables, that we want to 'bin' into fewer bins (=bars). But if we only have a small number of discrete values (like here: the integers from 0 to 7), then the alignment of the labels above may be more confusing. \n", "\n", "Let's try to use `.plot()` to make a bar chart:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD1CAYAAACrz7WZAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAAS6klEQVR4nO3db4xV+X3f8ffHYGOvHSdsd6AY2EBU7BjaGidT7HSrKgmOId7IrKquylaJULQufYAbu61aQ1rJygMkKkVpXalbCflPaRsbjTe2ILGUmJC6Vdpk8ax3ExswYWzWMIWFidvEsdfCBn/74J6tL3CHucAdBn55v6TVOed7fufc78zsfO7hN+fem6pCktSWVyx0A5Kk0TPcJalBhrskNchwl6QGGe6S1CDDXZIatHihGwB46KGHas2aNQvdhiTdV5599tk/raqxQfvuiXBfs2YNk5OTC92GJN1Xknxttn1Oy0hSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1KChwj3JP01yPMmXknwiyauTPJjkSJLT3XJp3/g9SaaSnEqyZf7alyQNMueLmJKsBH4JWF9V304yAWwH1gNHq2pfkt3AbuADSdZ3+zcAbwB+N8kbq+rqKBpes/szozjNNV7Y9+jIzylJC2nYaZnFwGuSLAYeAM4D24AD3f4DwGPd+jbgYFVdrqozwBSwaWQdS5LmNGe4V9X/Bn4VOAtcAP68qj4LLK+qC92YC8Cy7pCVwLm+U0x3NUnSXTJnuHdz6duAtfSmWV6b5OdvdsiA2g0f1JpkZ5LJJJMzMzPD9itJGsIw0zLvAM5U1UxVfRf4FPC3gYtJVgB0y0vd+Glgdd/xq+hN41yjqvZX1XhVjY+NDXxTM0nSbRom3M8Cb0/yQJIAm4GTwGFgRzdmB3CoWz8MbE+yJMlaYB1wbLRtS5JuZs67ZarqmSRPA18ArgDPAfuB1wETSZ6k9wTweDf+eHdHzYlu/K5R3SkjSRrOUO/nXlUfBD54Xfkyvav4QeP3AnvvrDVJ0u3yFaqS1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoDnDPcmbkjzf9983krw/yYNJjiQ53S2X9h2zJ8lUklNJtszvlyBJut6c4V5Vp6pqY1VtBH4ceAn4NLAbOFpV64Cj3TZJ1gPbgQ3AVuCpJIvmp31J0iC3Oi2zGfhKVX0N2AYc6OoHgMe69W3Awaq6XFVngClg0wh6lSQN6VbDfTvwiW59eVVdAOiWy7r6SuBc3zHTXU2SdJcMHe5JXgW8G/jkXEMH1GrA+XYmmUwyOTMzM2wbkqQh3MqV+88CX6iqi932xSQrALrlpa4+DazuO24VcP76k1XV/qoar6rxsbGxW+9ckjSrWwn3J/j+lAzAYWBHt74DONRX355kSZK1wDrg2J02Kkka3uJhBiV5APgZ4B/3lfcBE0meBM4CjwNU1fEkE8AJ4Aqwq6qujrRrSdJNDRXuVfUS8Feuq32d3t0zg8bvBfbecXeSpNviK1QlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0aKtyT/FCSp5N8OcnJJD+R5MEkR5Kc7pZL+8bvSTKV5FSSLfPXviRpkGGv3D8E/HZV/SjwFuAksBs4WlXrgKPdNknWA9uBDcBW4Kkki0bduCRpdnOGe5LXA38X+AhAVX2nqv4M2AYc6IYdAB7r1rcBB6vqclWdAaaATaNtW5J0M8Ncuf8IMAN8LMlzST6c5LXA8qq6ANAtl3XjVwLn+o6f7mqSpLtkmHBfDPwY8B+r6q3At+imYGaRAbW6YVCyM8lkksmZmZmhmpUkDWeYcJ8GpqvqmW77aXphfzHJCoBuealv/Oq+41cB568/aVXtr6rxqhofGxu73f4lSQPMGe5V9SJwLsmbutJm4ARwGNjR1XYAh7r1w8D2JEuSrAXWAcdG2rUk6aYWDznunwC/nuRVwFeBX6T3xDCR5EngLPA4QFUdTzJB7wngCrCrqq6OvHNJ0qyGCveqeh4YH7Br8yzj9wJ7b78tSdKd8BWqktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1KChwj3JC0m+mOT5JJNd7cEkR5Kc7pZL+8bvSTKV5FSSLfPVvCRpsFu5cv+pqtpYVS9/3N5u4GhVrQOOdtskWQ9sBzYAW4GnkiwaYc+SpDncybTMNuBAt34AeKyvfrCqLlfVGWAK2HQHjyNJukXDhnsBn03ybJKdXW15VV0A6JbLuvpK4FzfsdNdTZJ0lywectwjVXU+yTLgSJIv32RsBtTqhkG9J4mdAA8//PCQbUiShjHUlXtVne+Wl4BP05tmuZhkBUC3vNQNnwZW9x2+Cjg/4Jz7q2q8qsbHxsZu/yuQJN1gznBP8tokP/DyOvBO4EvAYWBHN2wHcKhbPwxsT7IkyVpgHXBs1I1LkmY3zLTMcuDTSV4e//Gq+u0knwcmkjwJnAUeB6iq40kmgBPAFWBXVV2dl+4lSQPNGe5V9VXgLQPqXwc2z3LMXmDvHXcnSbotvkJVkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGjR0uCdZlOS5JL/VbT+Y5EiS091yad/YPUmmkpxKsmU+GpckzW6Yz1B92fuAk8Dru+3dwNGq2pdkd7f9gSTrge3ABuANwO8meaOfoyqpRWt2f2bk53xh36N3fI6hrtyTrAIeBT7cV94GHOjWDwCP9dUPVtXlqjoDTAGb7rhTSdLQhp2W+XfAvwS+11dbXlUXALrlsq6+EjjXN266q10jyc4kk0kmZ2ZmbrVvSdJNzBnuSX4OuFRVzw55zgyo1Q2Fqv1VNV5V42NjY0OeWpI0jGHm3B8B3p3kXcCrgdcn+a/AxSQrqupCkhXApW78NLC67/hVwPlRNi1Jurk5r9yrak9VraqqNfT+UPp7VfXzwGFgRzdsB3CoWz8MbE+yJMlaYB1wbOSdS5JmdSt3y1xvHzCR5EngLPA4QFUdTzIBnACuALu8U0aS7q5bCveq+hzwuW7968DmWcbtBfbeYW+SpNvkK1QlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0a5gOyX53kWJI/SnI8ya909QeTHElyulsu7TtmT5KpJKeSbJnPL0CSdKNhrtwvAz9dVW8BNgJbk7wd2A0crap1wNFumyTr6X3W6gZgK/BUkkXz0LskaRbDfEB2VdU3u81Xdv8VsA040NUPAI9169uAg1V1uarOAFPAplE2LUm6uaHm3JMsSvI8cAk4UlXPAMur6gJAt1zWDV8JnOs7fLqrSZLukqHCvaquVtVGYBWwKclfv8nwDDrFDYOSnUkmk0zOzMwM1awkaTi3dLdMVf0Z8Dl6c+kXk6wA6JaXumHTwOq+w1YB5weca39VjVfV+NjY2K13Lkma1TB3y4wl+aFu/TXAO4AvA4eBHd2wHcChbv0wsD3JkiRrgXXAsRH3LUm6icVDjFkBHOjueHkFMFFVv5XkD4CJJE8CZ4HHAarqeJIJ4ARwBdhVVVfnp31J0iBzhntV/THw1gH1rwObZzlmL7D3jruTJN0WX6EqSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDRrmM1RXJ/lvSU4mOZ7kfV39wSRHkpzulkv7jtmTZCrJqSRb5vMLkCTdaJgr9yvAP6+qNwNvB3YlWQ/sBo5W1TrgaLdNt287sAHYCjzVff6qJOkumTPcq+pCVX2hW/8L4CSwEtgGHOiGHQAe69a3AQer6nJVnQGmgE0j7luSdBO3NOeeZA29D8t+BlheVReg9wQALOuGrQTO9R023dWuP9fOJJNJJmdmZm6jdUnSbIYO9ySvA34DeH9VfeNmQwfU6oZC1f6qGq+q8bGxsWHbkCQNYahwT/JKesH+61X1qa58McmKbv8K4FJXnwZW9x2+Cjg/mnYlScNYPNeAJAE+Apysql/r23UY2AHs65aH+uofT/JrwBuAdcCxUTZ9P1iz+zMjP+cL+x4d+TkltWnOcAceAX4B+GKS57vaL9ML9YkkTwJngccBqup4kgngBL07bXZV1dVRNy5Jmt2c4V5Vv8/geXSAzbMcsxfYewd9SZLugK9QlaQGDTMto0b5dwGpXV65S1KDDHdJapDhLkkNMtwlqUGGuyQ1yLtldM/zrh7p1hnu0l8yPln+5eC0jCQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDZoz3JN8NMmlJF/qqz2Y5EiS091yad++PUmmkpxKsmW+GpckzW6YK/f/BGy9rrYbOFpV64Cj3TZJ1gPbgQ3dMU8lWTSybiVJQ5kz3KvqfwD/57ryNuBAt34AeKyvfrCqLlfVGWAK2DSaViVJw7rdtx9YXlUXAKrqQpJlXX0l8Id946a7mtQ8X9ave8mo31tm0Adp18CByU5gJ8DDDz884jYk3e98srwzt3u3zMUkKwC65aWuPg2s7hu3Cjg/6ARVtb+qxqtqfGxs7DbbkCQNcrvhfhjY0a3vAA711bcnWZJkLbAOOHZnLUqSbtWc0zJJPgH8JPBQkmngg8A+YCLJk8BZ4HGAqjqeZAI4AVwBdlXV1XnqXZI0iznDvaqemGXX5lnG7wX23klTkqQ74ytUJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUHzFu5JtiY5lWQqye75ehxJ0o3mJdyTLAL+A/CzwHrgiSTr5+OxJEk3mq8r903AVFV9taq+AxwEts3TY0mSrpOqGv1Jk78PbK2q93TbvwC8rare2zdmJ7Cz23wTcGrEbTwE/OmIzzkf7HO07HO07oc+74ceYX76/OGqGhu0Y/GIH+hlGVC75lmkqvYD++fp8UkyWVXj83X+UbHP0bLP0bof+rwfeoS73+d8TctMA6v7tlcB5+fpsSRJ15mvcP88sC7J2iSvArYDh+fpsSRJ15mXaZmqupLkvcDvAIuAj1bV8fl4rJuYtymfEbPP0bLP0bof+rwfeoS73Oe8/EFVkrSwfIWqJDXIcJekBhnuktSgZsI9yY8m+UCSf5/kQ936mxe6r/tV9/3cnOR119W3LlRPgyTZlORvdevrk/yzJO9a6L5uJsl/Xuge5pLk73Tfy3cudC/9krwtyeu79dck+ZUkv5nk3yT5wYXu72VJfinJ6rlHzmMPLfxBNckHgCfovc3BdFdeRe8WzINVtW+hehtWkl+sqo8tdB/Q+x8T2AWcBDYC76uqQ92+L1TVjy1ge/9fkg/Se/+ixcAR4G3A54B3AL9TVXsXrrueJNffAhzgp4DfA6iqd9/1pgZIcqyqNnXr/4jez//TwDuB37xXfoeSHAfe0t2Rtx94CXga2NzV/96CNthJ8ufAt4CvAJ8APllVM3e1h0bC/U+ADVX13evqrwKOV9W6helseEnOVtXDC90HQJIvAj9RVd9MsobeL89/qaoPJXmuqt66sB32dH1uBJYALwKrquobSV4DPFNVf3Mh+4PekyFwAvgwvVdph94v+3aAqvrvC9fd9/X/XJN8HnhXVc0keS3wh1X1Nxa2w54kJ6vqzd36NRcaSZ6vqo0L1lyfJM8BP07vQuMfAO8GnqX3s/9UVf3FfPcwX28/cLd9D3gD8LXr6iu6ffeEJH882y5g+d3sZQ6LquqbAFX1QpKfBJ5O8sMMfmuJhXKlqq4CLyX5SlV9A6Cqvp3kXvm5jwPvA/4V8C+q6vkk375XQr3PK5IspTdVm5evMqvqW0muLGxr1/hS379y/yjJeFVNJnkj8N25Dr6Lqqq+B3wW+GySV9L7V+YTwK8CA98PZpRaCff3A0eTnAbOdbWHgb8GvHe2gxbAcmAL8H+vqwf4X3e/nVm9mGRjVT0P0F3B/xzwUeCeuILrfCfJA1X1Er2rJAC6udd7Ity7X/B/m+ST3fIi9+bv3Q/Su7IMUEn+alW92P3N5V56Qn8P8KEk/5rem3D9QZJz9H7v37OgnV3rmu9ZN6twGDjc/cty/htoYVoGIMkr6L3V8Ep639hp4PPdld09IclHgI9V1e8P2PfxqvqHC9DWDZKsondV/OKAfY9U1f9cgLZukGRJVV0eUH8IWFFVX1yAtm4qyaPAI1X1ywvdyzCSPAAsr6ozC91LvyQ/APwIvSfK6aq6uMAtXSPJG6vqTxa0h1bCXZL0fc3cCilJ+j7DXZIaZLhLUoMMd0lqkOEuSQ36fw4iahdlH02PAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df['internet'].value_counts().sort_index().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## POP QUIZ!\n", "\n", "Can you integrate this plotting method in your for-loop (from above) to get a nice series of plots? Fill in the missing line of code, below. But keep the plt.show() command afterward, in order to display all plots.\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RADIO\n", "0 0.292167\n", "7 0.199904\n", "5 0.169149\n", "3 0.082653\n", "4 0.075925\n", "2 0.066314\n", "6 0.058145\n", "1 0.055742\n", "Name: radio, dtype: float64\n", "-------------------------------------------\n", "\n", "NEWSPAPER\n", "0 0.356559\n", "6 0.252763\n", "7 0.126862\n", "1 0.081211\n", "2 0.061028\n", "5 0.055262\n", "3 0.038443\n", "4 0.027871\n", "Name: newspaper, dtype: float64\n", "-------------------------------------------\n", "\n", "TV\n", "7 0.271024\n", "5 0.149447\n", "0 0.143681\n", "6 0.112446\n", "4 0.095147\n", "3 0.082653\n", "2 0.074003\n", "1 0.071600\n", "Name: tv, dtype: float64\n", "-------------------------------------------\n", "\n", "INTERNET\n", "0 0.389716\n", "7 0.197021\n", "1 0.090822\n", "2 0.083614\n", "3 0.072081\n", "5 0.069678\n", "4 0.049976\n", "6 0.047093\n", "Name: internet, dtype: float64\n", "-------------------------------------------\n", "\n" ] } ], "source": [ "for medium in ['radio','newspaper','tv','internet']:\n", " print(medium.upper())\n", " print(df[medium].value_counts(sort=True, normalize=True))\n", " print('-------------------------------------------\\n')\n", " #YOUR CODE HERE\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And instead of (or in addition to) the plt.show(), you can also save these plots to your folder on your computer. These are very high quality images then, that could be used in a piece (if you provided appropriate axis titles, etc.), and you can specify the figure size and DPI.\n", "\n", "Note here we've added a 'figsize' specification to the end of the plot method in your missing line of code. You can play around with different figure sizes to see what happens, if you display them here using plt.show()." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RADIO\n", "0 0.292167\n", "7 0.199904\n", "5 0.169149\n", "3 0.082653\n", "4 0.075925\n", "2 0.066314\n", "6 0.058145\n", "1 0.055742\n", "Name: radio, dtype: float64\n", "-------------------------------------------\n", "\n" ] }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "NEWSPAPER\n", "0 0.356559\n", "6 0.252763\n", "7 0.126862\n", "1 0.081211\n", "2 0.061028\n", "5 0.055262\n", "3 0.038443\n", "4 0.027871\n", "Name: newspaper, dtype: float64\n", "-------------------------------------------\n", "\n" ] }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "TV\n", "7 0.271024\n", "5 0.149447\n", "0 0.143681\n", "6 0.112446\n", "4 0.095147\n", "3 0.082653\n", "2 0.074003\n", "1 0.071600\n", "Name: tv, dtype: float64\n", "-------------------------------------------\n", "\n" ] }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "INTERNET\n", "0 0.389716\n", "7 0.197021\n", "1 0.090822\n", "2 0.083614\n", "3 0.072081\n", "5 0.069678\n", "4 0.049976\n", "6 0.047093\n", "Name: internet, dtype: float64\n", "-------------------------------------------\n", "\n" ] }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for medium in ['radio','newspaper','tv','internet']:\n", " print(medium.upper())\n", " print(df[medium].value_counts(sort=True, normalize=True))\n", " print('-------------------------------------------\\n')\n", " #YOUR CODE HERE ...(kind='bar', figsize=(6,4))\n", " plt.savefig('{}.png'.format(medium), dpi=400)\n", " plt.show()\n", "\n", "#Now go check your folder and see if the image files have shown up.\n", "#Note that we have to use the curly brackets and .format(medium) to give \n", "#the relevant title to each figure. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plots grouped by variables\n", "\n", "You can also create comparison histograms, side-by-side, for different values of a variable. For example, let's look at the histogram of internet news use for men and women in this dataset.\n", "\n", "Here, we're using the \"by=[' ']\" command to specify which grouping variable we want, and again specifying the bins and the figure size, both of which you can play around with." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ], dtype=object)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.hist(column='internet', by=['gender'], bins=7, figsize=(10,5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistical tests and subsetting datasets\n", "\n", "Now, if we want to move onto statistical comparisons, we can run our normal, basic statistics here in python as well. There's no need to import your datset to SPSS to do this, if you want to know whether a specific difference is significant, for example.\n", "\n", "### T-tests\n", "\n", "Let's start with a t-test, comparing the mean internet news use for men and women that we just examined in the histograms. \n", "\n", "In order to do this, we have to create two new little dataframes out of our first one - one for men, one for women.\n", "\n", "We are using the ability to filter a dataframe (e.g., `df[df['gender']==1]` to create a dataframe only for males; adding `['internet']` at the end selects only the column for internet). This can be handy to select only relevant data for your story out of a much larger dataset!" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "males_internet = df[df['gender']==1]['internet']\n", "females_internet = df[df['gender']==0]['internet']\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each of these new dataframes can then be described and explored as we do with any pandas dataframe, and using `.describe()`, remember, gives us the mean score (handy for our t-test!)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1002.000000\n", "mean 3.021956\n", "std 2.856809\n", "min 0.000000\n", "25% 0.000000\n", "50% 2.000000\n", "75% 6.000000\n", "max 7.000000\n", "Name: internet, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "males_internet.describe()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1079.000000\n", "mean 2.370714\n", "std 2.682610\n", "min 0.000000\n", "25% 0.000000\n", "50% 1.000000\n", "75% 5.000000\n", "max 7.000000\n", "Name: internet, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "females_internet.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the male mean is 3.02, and the female mean is 2.37. But we don't know if, based on the sample, this is a significant difference. We don't want to make misleading claims in our story! So, run a t-test. (Specifically, an independent samples t-test.)\n", "\n", "The results return the test statistic, p-value, and the degrees of freedom (in that order). " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5.363006854632657, 9.094061516694626e-08, 2079.0)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ttest_ind(males_internet,females_internet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that males use the internet significantly more often than females (that e-08 means the p-value is REALLY tiny). \n", "\n", "We could also do some pretty-printing if we wanted to, to display this more nicely for ourselves.\n", "\n", "The \"._f\" specification is how many decimal places; the integer before the colon is the position of the output from the default t-test command.\n", "\n", "And again, here we see the use of \".format()\" as a method to input something from the ongoing calculation." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "t(2079) = 5.363, p = 0.000\n" ] } ], "source": [ "results = ttest_ind(males_internet,females_internet)\n", "print('t({2:.0f}) = {0:.3f}, p = {1:.3f}'.format(*results))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look into some continous variables. First of all, let us create one: We make a subset of our dataframe that contains only the media variables, apply the `.mean()` method to it (`axis = 1` means that we want to apply it row-wise), and then we assign the result of this to a new colum in the original dataframe.\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "df['meanmedia'] = df[['radio','internet','newspaper','tv']].mean(axis=1)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#We can then plot this mean media usage (for news) by age, using a scatterplot, e.g.\n", "#Feel free to play around with the color parameters, and remember to use help commands to \n", "#find out more about formatting these plots.\n", "\n", "df.plot.scatter(x='age', y='meanmedia', c='blue')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are obviously many more possibilities here, including running a correlation between age and mean media use, for example, or using ANOVAs if you had more than 2 groups to compare, etc. We don't have time to show all of this to you in class, but remember there is a ton of resources online, so you should just search away to find what you need. If you have problems understanding specific modules or commands you find online, you can approach us during our open lab sessions with questions as to how to apply these techniques to your own data story." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Before we finish, let's play around with some more graphics\n", "\n", "The seaborn library (which we imported at the beginning) offers a lot of cool stuff." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll make a simple correlation matrix of the four media in this datset." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "corrmatrix = df[['internet','tv','radio','newspaper']].corr()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
internettvradionewspaper
internet1.0000000.1211920.098797-0.005689
tv0.1211921.0000000.2700310.350694
radio0.0987970.2700311.0000000.230926
newspaper-0.0056890.3506940.2309261.000000
\n", "
" ], "text/plain": [ " internet tv radio newspaper\n", "internet 1.000000 0.121192 0.098797 -0.005689\n", "tv 0.121192 1.000000 0.270031 0.350694\n", "radio 0.098797 0.270031 1.000000 0.230926\n", "newspaper -0.005689 0.350694 0.230926 1.000000" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corrmatrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But think of ways that are more useful to display this to audiences, who may not want to deal with a correlation matrix. Heatmaps are one way to do this:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.heatmap(corrmatrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks okay, but is a bit redundant, so it would be great if we could sort of 'white out' the unnecessary (replicated) top triangle of the chart, and use colors that are more intuitive - usually darker means a stronger relationship in a heat map, right?\n", "\n", "Here, note that even Damian (EVEN DAMIAN!) can't reproduce all of this out of his head. But if you look around online, or use what we show you here and adapt it, you can do a lot of amazing graphics stuff." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/fhopp/opt/anaconda3/envs/dj21/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.\n", "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n", " \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.set(style=\"white\")\n", "mask = np.zeros_like(corrmatrix, dtype=np.bool)\n", "mask[np.triu_indices_from(mask)] = True\n", "cmap = sns.light_palette(\"red\",as_cmap=True)\n", "sns.heatmap(corrmatrix,mask=mask,cmap=cmap,vmin=0,vmax=.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## So...\n", "\n", "there are lots of possibilities here. Remember: google is your friend here!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More (non-graded) homework :)\n", "\n", "Using the Iris dataset from last Wednesday, try the following:\n", "1. Describe the dataset\n", "2. Find the value counts of the 'species' column\n", "3. Describe the dataset for each of the species separately.\n", "4. Transpose the output for this previous command.\n", "5. Create side-by-side histograms of petal length for each species.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardless whether you were able to do that, here's a really cool graphic to show you. In this case, we're plotting petal width by petal length, with a different color for each species. This also uses the seaborn library (indicated by sns). Because of the nature of this dataset and the values within it, it works quite well.)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessetosaversicolorvirginica
sepal_lengthcount50.00000050.00000050.000000
mean5.0060005.9360006.588000
std0.3524900.5161710.635880
min4.3000004.9000004.900000
25%4.8000005.6000006.225000
50%5.0000005.9000006.500000
75%5.2000006.3000006.900000
max5.8000007.0000007.900000
sepal_widthcount50.00000050.00000050.000000
mean3.4280002.7700002.974000
std0.3790640.3137980.322497
min2.3000002.0000002.200000
25%3.2000002.5250002.800000
50%3.4000002.8000003.000000
75%3.6750003.0000003.175000
max4.4000003.4000003.800000
petal_lengthcount50.00000050.00000050.000000
mean1.4620004.2600005.552000
std0.1736640.4699110.551895
min1.0000003.0000004.500000
25%1.4000004.0000005.100000
50%1.5000004.3500005.550000
75%1.5750004.6000005.875000
max1.9000005.1000006.900000
petal_widthcount50.00000050.00000050.000000
mean0.2460001.3260002.026000
std0.1053860.1977530.274650
min0.1000001.0000001.400000
25%0.2000001.2000001.800000
50%0.2000001.3000002.000000
75%0.3000001.5000002.300000
max0.6000001.8000002.500000
\n", "
" ], "text/plain": [ "species setosa versicolor virginica\n", "sepal_length count 50.000000 50.000000 50.000000\n", " mean 5.006000 5.936000 6.588000\n", " std 0.352490 0.516171 0.635880\n", " min 4.300000 4.900000 4.900000\n", " 25% 4.800000 5.600000 6.225000\n", " 50% 5.000000 5.900000 6.500000\n", " 75% 5.200000 6.300000 6.900000\n", " max 5.800000 7.000000 7.900000\n", "sepal_width count 50.000000 50.000000 50.000000\n", " mean 3.428000 2.770000 2.974000\n", " std 0.379064 0.313798 0.322497\n", " min 2.300000 2.000000 2.200000\n", " 25% 3.200000 2.525000 2.800000\n", " 50% 3.400000 2.800000 3.000000\n", " 75% 3.675000 3.000000 3.175000\n", " max 4.400000 3.400000 3.800000\n", "petal_length count 50.000000 50.000000 50.000000\n", " mean 1.462000 4.260000 5.552000\n", " std 0.173664 0.469911 0.551895\n", " min 1.000000 3.000000 4.500000\n", " 25% 1.400000 4.000000 5.100000\n", " 50% 1.500000 4.350000 5.550000\n", " 75% 1.575000 4.600000 5.875000\n", " max 1.900000 5.100000 6.900000\n", "petal_width count 50.000000 50.000000 50.000000\n", " mean 0.246000 1.326000 2.026000\n", " std 0.105386 0.197753 0.274650\n", " min 0.100000 1.000000 1.400000\n", " 25% 0.200000 1.200000 1.800000\n", " 50% 0.200000 1.300000 2.000000\n", " 75% 0.300000 1.500000 2.300000\n", " max 0.600000 1.800000 2.500000" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.groupby('species').describe().T" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[,\n", " ],\n", " [, ]],\n", " dtype=object)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "iris.hist(column='petal_length', by=['species'], figsize=(10,5))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.scatterplot(x=\"petal_width\", y=\"petal_length\", hue=\"species\", data=iris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "## Appendix: Multivariate statistical analysis\n", "\n", "For those who are interested, here's a brief bit on multivariate analyses. Here, we're focusing on the same comparison of internet news use between men and women, but first, let's see whether that holds when we control for political interest. \n", "\n", "Before we can do that, we have to bring in another datset, however, and join it. You can access this dataset and save it from the following URL: https://raw.githubusercontent.com/damian0604/bdaca/master/ipynb/intpol.csv\n", "\n", "We'll talk more about aggregating/merging datasets in a later session, so for now just go with it." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# intpol=pd.read_csv('intpol.csv') # if you stored it locally \n", "intpol=pd.read_csv('https://raw.githubusercontent.com/damian0604/bdaca/master/ipynb/intpol.csv') # if reading it directly from the website" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "combined = df.join(intpol)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genderageeducationradionewspapertvinternetmeanmediaintpol
01714.056504.004
11402.060001.501
21412.043734.254
30655.000501.254
40392.001773.751
..............................
20760495.036603.755
20770514.077556.006
20781316.035564.757
20790586.033101.753
20801213.026644.505
\n", "

2081 rows × 9 columns

\n", "
" ], "text/plain": [ " gender age education radio newspaper tv internet meanmedia \\\n", "0 1 71 4.0 5 6 5 0 4.00 \n", "1 1 40 2.0 6 0 0 0 1.50 \n", "2 1 41 2.0 4 3 7 3 4.25 \n", "3 0 65 5.0 0 0 5 0 1.25 \n", "4 0 39 2.0 0 1 7 7 3.75 \n", "... ... ... ... ... ... .. ... ... \n", "2076 0 49 5.0 3 6 6 0 3.75 \n", "2077 0 51 4.0 7 7 5 5 6.00 \n", "2078 1 31 6.0 3 5 5 6 4.75 \n", "2079 0 58 6.0 3 3 1 0 1.75 \n", "2080 1 21 3.0 2 6 6 4 4.50 \n", "\n", " intpol \n", "0 4 \n", "1 1 \n", "2 4 \n", "3 4 \n", "4 1 \n", "... ... \n", "2076 5 \n", "2077 6 \n", "2078 7 \n", "2079 3 \n", "2080 5 \n", "\n", "[2081 rows x 9 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do an OLS regression. In order to do so, we need to define a model and then run it. When defining the model, you create the equation in the following manner:\n", "* First you include your dependent variable, followed by the ~ sign\n", "* Then you include the independent variables (separated by the + sign)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "m1 = smf.ols(formula='internet ~ age + gender + education', data=combined).fit()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: internet R-squared: 0.085
Model: OLS Adj. R-squared: 0.084
Method: Least Squares F-statistic: 63.91
Date: Fri, 05 Nov 2021 Prob (F-statistic): 1.65e-39
Time: 11:19:47 Log-Likelihood: -4951.8
No. Observations: 2065 AIC: 9912.
Df Residuals: 2061 BIC: 9934.
Df Model: 3
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 1.1512 0.233 4.941 0.000 0.694 1.608
age -0.0119 0.003 -3.675 0.000 -0.018 -0.006
gender 0.6224 0.118 5.283 0.000 0.391 0.853
education 0.4175 0.035 11.763 0.000 0.348 0.487
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 976.493 Durbin-Watson: 1.978
Prob(Omnibus): 0.000 Jarque-Bera (JB): 187.053
Skew: 0.481 Prob(JB): 2.41e-41
Kurtosis: 1.882 Cond. No. 199.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: internet R-squared: 0.085\n", "Model: OLS Adj. R-squared: 0.084\n", "Method: Least Squares F-statistic: 63.91\n", "Date: Fri, 05 Nov 2021 Prob (F-statistic): 1.65e-39\n", "Time: 11:19:47 Log-Likelihood: -4951.8\n", "No. Observations: 2065 AIC: 9912.\n", "Df Residuals: 2061 BIC: 9934.\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 1.1512 0.233 4.941 0.000 0.694 1.608\n", "age -0.0119 0.003 -3.675 0.000 -0.018 -0.006\n", "gender 0.6224 0.118 5.283 0.000 0.391 0.853\n", "education 0.4175 0.035 11.763 0.000 0.348 0.487\n", "==============================================================================\n", "Omnibus: 976.493 Durbin-Watson: 1.978\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 187.053\n", "Skew: 0.481 Prob(JB): 2.41e-41\n", "Kurtosis: 1.882 Cond. No. 199.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1.summary()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: internet R-squared: 0.100
Model: OLS Adj. R-squared: 0.099
Method: Least Squares F-statistic: 57.45
Date: Fri, 05 Nov 2021 Prob (F-statistic): 5.12e-46
Time: 11:19:47 Log-Likelihood: -4934.5
No. Observations: 2065 AIC: 9879.
Df Residuals: 2060 BIC: 9907.
Df Model: 4
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 1.0389 0.232 4.481 0.000 0.584 1.494
age -0.0196 0.003 -5.642 0.000 -0.026 -0.013
gender 0.5212 0.118 4.413 0.000 0.290 0.753
education 0.3447 0.037 9.240 0.000 0.272 0.418
intpol 0.2230 0.038 5.910 0.000 0.149 0.297
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 763.270 Durbin-Watson: 1.972
Prob(Omnibus): 0.000 Jarque-Bera (JB): 179.407
Skew: 0.483 Prob(JB): 1.10e-39
Kurtosis: 1.926 Cond. No. 200.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: internet R-squared: 0.100\n", "Model: OLS Adj. R-squared: 0.099\n", "Method: Least Squares F-statistic: 57.45\n", "Date: Fri, 05 Nov 2021 Prob (F-statistic): 5.12e-46\n", "Time: 11:19:47 Log-Likelihood: -4934.5\n", "No. Observations: 2065 AIC: 9879.\n", "Df Residuals: 2060 BIC: 9907.\n", "Df Model: 4 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 1.0389 0.232 4.481 0.000 0.584 1.494\n", "age -0.0196 0.003 -5.642 0.000 -0.026 -0.013\n", "gender 0.5212 0.118 4.413 0.000 0.290 0.753\n", "education 0.3447 0.037 9.240 0.000 0.272 0.418\n", "intpol 0.2230 0.038 5.910 0.000 0.149 0.297\n", "==============================================================================\n", "Omnibus: 763.270 Durbin-Watson: 1.972\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 179.407\n", "Skew: 0.483 Prob(JB): 1.10e-39\n", "Kurtosis: 1.926 Cond. No. 200.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m2 = smf.ols(formula='internet ~ age + gender + education + intpol', data=combined).fit()\n", "m2.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also do a test to see whether M2 is better than M1 (it is, in this case:)\n", "(see also http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.compare_lr_test.html?highlight=compare_lr_test )\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(34.71733114293056, 3.8122257183791596e-09, 1.0)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m2.compare_lr_test(m1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hexplots\n", "\n", "We have seen scatterplots at work above. Scatterplots are a cool way to show the relationship between two variables, but they mainly work well if both variables have a lot of different values (say, the money people earn in Euros' (and not in categories!), or the time people spent on Facebook in exact minutes). However, if we have only few possible values (such as the integers from 0 to 7, as in our examples above), the dots in the scatterplot will overlap, and an observation that only occurs one single time looks exactly like an observation that occurs 1000 times.\n", "\n", "A hexplot is very much like a scatterplot, but *the more observations overlap at the same (hexagon-shaped) place in the graph, the darker it gets.*\n", "\n", "To make it even more informative, we add histograms of the two variables in the margin, so that you can immediately get an idea of the distributions. This, again, helps us to understand whether there are just a few (very old, very young) people that behave in some way (no media at all, media every day), or whether it's a general pattern." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/fhopp/opt/anaconda3/envs/dj21/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n", " FutureWarning\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.jointplot(combined['age'], combined['meanmedia'] , \n", " kind=\"hex\", color=\"#4CB391\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 4 }