import numpy as np
import pandas as pd
4 Tabular Data and Basic Data Operations
4.1 Introduction
This document demonstrates the use of the pandas
library in Python to do basic data wrangling and summarization.
If you do not have the pandas
library installed then you will need to run
pip install pandas
in the Jupyter terminal to install. Remember: you only need to install once per machine (or Colab session, for packages that don’t come pre-installed).
4.2 Reading Tabular Data into Python
We’re going to be exploring pandas
in the context of the famous Titanic dataset. We’ll work with a subset of this dataset, but more information about it all can be found here.
We start by loading the numpy
and pandas
libraries. Most of our data wrangling work will happen with functions from the pandas
library, but the numpy
library will be useful for performing certain mathematical operations should we choose to transform any of our data.
= "https://dlsun.github.io/pods/data/"
data_dir = pd.read_csv(data_dir + "titanic.csv") df_titanic
Data files of many different types and shapes can be read into Python with similar functions, but we will focus on tabular data.
4.2.1 Tidy Data is Special Tabular Data
For most people, the image that comes to mind when thinking about data is indeed something tabular or spreadsheet-like in nature. Which is great!
Tabular data is a form preferred by MANY different data operations and work. However, we will want to take this one step further. In almost all data science work we want our data to be tidy
A dataset is tidy if it adheres to following three characteristics:
Every column is a variable
Every row is an observation
Every cell is a single value
You may have noticed that plotnine
(ggplot
) is basically built to take tidy data. Variables are specified in the aesthetics function to map them (i.e. columns) in our dataset to plot elements. This type of behavior is EXTREMELY common among functions that work with data in all languages, and so the importance of getting our data into a tidy format cannot be overstated.
In Python, there are at least two quick ways to view a dataset we’ve read in:
df_titanic
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | S | United States | 5547.0 | 7.11 | 0 |
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | S | United States | 2673.0 | 20.05 | 0 |
2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | S | United States | 2673.0 | 20.05 | 0 |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | S | England | 2673.0 | 20.05 | 1 |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | S | Norway | 348125.0 | 7.13 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2202 | Wynn, Mr. Walter | male | 41.0 | deck crew | B | England | NaN | NaN | 1 |
2203 | Yearsley, Mr. Harry | male | 40.0 | victualling crew | S | England | NaN | NaN | 1 |
2204 | Young, Mr. Francis James | male | 32.0 | engineering crew | S | England | NaN | NaN | 0 |
2205 | Zanetti, Sig. Minio | male | 20.0 | restaurant staff | S | England | NaN | NaN | 0 |
2206 | Zarracchi, Sig. L. | male | 26.0 | restaurant staff | S | England | NaN | NaN | 0 |
2207 rows × 9 columns
df_titanic.head()
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | S | United States | 5547.0 | 7.11 | 0 |
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | S | United States | 2673.0 | 20.05 | 0 |
2 | Abbott, Mr. Rossmore Edward | male | 16.0 | 3rd | S | United States | 2673.0 | 20.05 | 0 |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | S | England | 2673.0 | 20.05 | 1 |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | S | Norway | 348125.0 | 7.13 | 1 |
The latter (.head()
) is usually preferred in case the dataset is large.
4.3 The “Big Five” Verbs of Data Wrangling
Data wrangling can involve a lot of different steps and operations to get data into a tidy format and ready for analysis and visualization. The vast majority of these fall under the umbrella one the following five operations:
Select columns/variables of interest
Filter rows/observations of interest
Arrange the rows of a dataset by column(s) of interest (i.e. order or sort)
Mutate the columns of a dataset (i.e. create or transform variables)
Summarize the rows of a dataset for column(s) of interest
4.3.1 Select Columns/Variables
Suppose we want to select the age
variable from the titanic DataFrame
. There are three ways to do this.
- Use
.loc
, specifying both the rows and columns. (The colon:
is Python shorthand for “all”.)
"age"] df_titanic.loc[:,
- Access the column as you would a key in a
dict
.
"age"] df_titanic[
- Access the column as an attribute of the
DataFrame
.
df_titanic.age
Method 3 (attribute access) is the most concise. However, it does not work if the variable name contains spaces or special characters, begins with a number, or matches an existing attribute of the DataFrame
. So, methods 1 and 2 are usually safer and preferred.
To select multiple columns, you would pass in a list of variable names, instead of a single variable name. For example, to select both age
and fare
, either of the two methods below would work (and produce the same result):
# Method 1
"age", "fare"]].head()
df_titanic.loc[:, [
# Method 2
"age", "fare"]].head() df_titanic[[
4.3.2 Filter Rows/Observations
4.3.2.1 Selecting Rows/Observations by Location
Before we see how to filter (i.e. subset) the rows of dataset based on some condition, let’s see how to select rows by explicitly identifying them.
We can select a row by its position using the .iloc
attribute. Keeping in mind that the first row is actually row 0, the fourth row could be extracted as:
3] df_titanic.iloc[
name Abbott, Mrs. Rhoda Mary 'Rosa'
gender female
age 39.0
class 3rd
embarked S
country England
ticketno 2673.0
fare 20.05
survived 1
Name: 3, dtype: object
Notice that a single row from a DataFrame
is no longer a DataFrame
but a different data structure, called a Series
.
We can also select multiple rows by passing a list of positions to .iloc
.
1, 3]] df_titanic.iloc[[
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
1 | Abbott, Mr. Eugene Joseph | male | 13.0 | 3rd | S | United States | 2673.0 | 20.05 | 0 |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | S | England | 2673.0 | 20.05 | 1 |
Notice that when we select multiple rows, we get a DataFrame
back.
So a Series
is used to store a single observation (across multiple variables), while a DataFrame
is used to store multiple observations (across multiple variables).
If selecting consecutive rows, we can use Python’s slice
notation. For example, the code below selects all rows from the fourth row, up to (but not including) the tenth row.
3:9] df_titanic.iloc[
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | S | England | 2673.0 | 20.0500 | 1 |
4 | Abelseth, Miss. Karen Marie | female | 16.0 | 3rd | S | Norway | 348125.0 | 7.1300 | 1 |
5 | Abelseth, Mr. Olaus Jørgensen | male | 25.0 | 3rd | S | United States | 348122.0 | 7.1300 | 1 |
6 | Abelson, Mr. Samuel | male | 30.0 | 2nd | C | France | 3381.0 | 24.0000 | 0 |
7 | Abelson, Mrs. Hannah | female | 28.0 | 2nd | C | France | 3381.0 | 24.0000 | 1 |
8 | Abī-Al-Munà, Mr. Nāsīf Qāsim | male | 27.0 | 3rd | C | Lebanon | 2699.0 | 18.1509 | 1 |
4.3.2.2 Selecting Rows/Observations by Condition
We’ll often want to filter or subset the rows of a dataset based on some condition. To do this we’ll take advantage of vectorization and boolean masking.
Recall that we can compare the values of a variable/column to a particular value in the following way, and observe the result.
"age"] > 30 df_titanic[
0 True
1 False
2 False
3 True
4 False
...
2202 True
2203 True
2204 True
2205 False
2206 False
Name: age, Length: 2207, dtype: bool
We can use these True
and False
values to filter/subset the dataset! The following subsets the titanic dataset down to only those individuals (rows) with ages over 30.
"age"] > 30] df_titanic[df_titanic[
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
0 | Abbing, Mr. Anthony | male | 42.0 | 3rd | S | United States | 5547.0 | 7.1100 | 0 |
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | S | England | 2673.0 | 20.0500 | 1 |
12 | Ahlin, Mrs. Johanna Persdotter | female | 40.0 | 3rd | S | Sweden | 7546.0 | 9.0906 | 0 |
15 | Aldworth, Mr. Augustus Henry | male | 35.0 | 2nd | S | England | 248744.0 | 13.0000 | 0 |
21 | Allen, Mr. William Henry | male | 39.0 | 3rd | S | England | 373450.0 | 8.0100 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2197 | Worthman, Mr. William Henry | male | 37.0 | engineering crew | S | England | NaN | NaN | 0 |
2200 | Wright, Mr. William | male | 40.0 | victualling crew | S | England | NaN | NaN | 1 |
2202 | Wynn, Mr. Walter | male | 41.0 | deck crew | B | England | NaN | NaN | 1 |
2203 | Yearsley, Mr. Harry | male | 40.0 | victualling crew | S | England | NaN | NaN | 1 |
2204 | Young, Mr. Francis James | male | 32.0 | engineering crew | S | England | NaN | NaN | 0 |
984 rows × 9 columns
We can combine multiple conditions using &
(and) and |
(or). The following subsets the titanic dataset down to females over 30 years of age.
"age"] > 30) & (df_titanic["gender"] == "female")] df_titanic[(df_titanic[
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
3 | Abbott, Mrs. Rhoda Mary 'Rosa' | female | 39.0 | 3rd | S | England | 2673.0 | 20.0500 | 1 |
12 | Ahlin, Mrs. Johanna Persdotter | female | 40.0 | 3rd | S | Sweden | 7546.0 | 9.0906 | 0 |
35 | Andersson, Miss. Ida Augusta Margareta | female | 38.0 | 3rd | S | Sweden | 347091.0 | 7.1506 | 0 |
40 | Andersson, Mrs. Alfrida Konstantia Brogren | female | 39.0 | 3rd | S | Sweden | 347082.0 | 31.0506 | 0 |
44 | Andrews, Miss. Kornelia Theodosia | female | 62.0 | 1st | C | United States | 13502.0 | 77.1902 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1997 | Robinson, Mrs. Annie | female | 41.0 | victualling crew | S | England | NaN | NaN | 1 |
2059 | Smith, Miss. Katherine Elizabeth | female | 45.0 | victualling crew | S | England | NaN | NaN | 1 |
2076 | Stap, Miss. Sarah Agnes | female | 47.0 | victualling crew | S | England | NaN | NaN | 1 |
2143 | Wallis, Mrs. Catherine Jane | female | 36.0 | victualling crew | S | England | NaN | NaN | 0 |
2145 | Walsh, Miss. Catherine | female | 32.0 | victualling crew | S | Ireland | NaN | NaN | 0 |
206 rows × 9 columns
4.3.3 Arrange Rows
As part of exploratory data analysis and some reporting efforts, we will want to sort a dataset or set of results by one or more variables of interest.
We can do this with .sort_values
in either ascending or descending order.
The following sorts the titanic dataset by age
in decreasing order.
= ["age"], ascending=False) df_titanic.sort_values(by
name | gender | age | class | embarked | country | ticketno | fare | survived | |
---|---|---|---|---|---|---|---|---|---|
1176 | Svensson, Mr. Johan | male | 74.000000 | 3rd | S | Sweden | 347060.0 | 7.1506 | 0 |
820 | Mitchell, Mr. Henry Michael | male | 72.000000 | 2nd | S | England | 24580.0 | 10.1000 | 0 |
53 | Artagaveytia, Mr. Ramon | male | 71.000000 | 1st | C | Argentina | 17609.0 | 49.1001 | 0 |
456 | Goldschmidt, Mr. George B. | male | 71.000000 | 1st | C | United States | 17754.0 | 34.1301 | 0 |
282 | Crosby, Captain. Edward Gifford | male | 70.000000 | 1st | S | United States | 5735.0 | 71.0000 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1182 | Tannūs, Master. As'ad | male | 0.416667 | 3rd | C | Lebanon | 2625.0 | 8.1004 | 1 |
296 | Danbom, Master. Gilbert Sigvard Emanuel | male | 0.333333 | 3rd | S | Sweden | 347080.0 | 14.0800 | 0 |
316 | Dean, Miss. Elizabeth Gladys 'Millvina' | female | 0.166667 | 3rd | S | England | 2315.0 | 20.1106 | 1 |
439 | Gheorgheff, Mr. Stanio | male | NaN | 3rd | C | Bulgaria | 349254.0 | 7.1711 | 0 |
677 | Kraeff, Mr. Theodor | male | NaN | 3rd | C | Bulgaria | 349253.0 | 7.1711 | 0 |
2207 rows × 9 columns
Notice that in these last few sections, we have not made any permanent changes to the df_titanic
object. We have only asked python do some selecting/filtering/sorting and then to print out the results, not save them.
If we wanted df_titanic
to become permanently sorted by age, we would re-assign the object:
= df_titanic.sort_values(by = ["age"], ascending=False) df_titanic
You will sometimes see object reassignment happen in a different way, using an inplace = True
argument, like this:
= ["age"], ascending=False, inplace=True) df_titanic.sort_values(by
We strongly recommend against this approach, for two reason:
When an object is “overwritten” via reassignment, that’s a major decision; you lose the old version of the object. It should be made deliberately and obviously. The
inplace
argument is easy to miss when copying/editing code, so it can lead to accidental overwriting that is hard to keep track of.Not all functions of DataFrames have an
inplace
option. It can be frustrating to get into the habit of using it, only to find out the hard way that it’s not available half the time!
4.3.4 Mutate Column(s)
The variables available to us in our original dataset contain all of the information we have access to, but the best insights may instead come from transformations of those variables.
4.3.4.1 Transforming Quantitative Variables
One of the simplest reasons to want to transform a quantitative variable is to change the measurement units.
Here we change the age
of passengers from a value in years to a value in decades.
"age"] = df_titanic["age"] / 10 df_titanic[
If we have a quantitative variable that is particularly skewed, then it might be a good idea to transform the values of that variable…like taking the log
of the values.
This was a strategy you saw employed with the GapMinder data!
Below is an example of taking the log
of the fare
variable. Notice that we’re making use of the numpy
here to take the log
.
"fare"] = np.log(df_titanic["fare"]) df_titanic[
Remember that we can take advantage of vectorization here too. The following operation wouldn’t really make physical sense, but it’s an example of creating a new variable out of existing variables.
"nonsense"] = df_titanic["fare"] / df_titanic["age"] df_titanic[
Note that we created the new variable, nonsense
, by specifying on the left side of the =
here and populating that column/variable via the expression on the right side of the =
.
We could want to create a new variable by categorizing (or discretizing) the values of a quantitative variable (i.e. convert a quantitative variable to a categorical variable). We can do so with cut
.
In the following, we create a new age_cat
variable which represents whether a person is a child or an adult.
"age_cat"] = pd.cut(df_titanic["age"],
df_titanic[= [0, 18, 100],
bins = ["child", "adult"]) labels
4.3.4.2 Transforming Categorical Variables
In some situations, especially later with modeling, we’ll need to convert categorical variables (stored as text) into quantitative (often coded) variables. Binary categorical variables can be converted into quantitative variables by coding one category as 1 and the other category as 0. (In fact, the survived column in the titanic dataset has already been coded this way.) The easiest way to do this is to create a boolean mask. For example, to convert gender
to a quantitative variable female
, which is 1 if the passenger was female and 0 otherwise, we can do the following:
"female"] = 1 * (df_titanic["gender"] == "female") df_titanic[
What do we do about a categorical variable with more than twwo categories, like embarked
, which has four categories? In general, a categorical variable with K categories can be converted into K separate 0/1 variables, or dummy variables. Each of the K dummy variables is an indicator for one of the K categories. That is, a dummy variable is 1 if the observation fell into its particular category and 0 otherwise.
Although it is not difficult to create dummy variables manually, the easiest way to create them is the get_dummies()
function in pandas
.
"embarked"]) pd.get_dummies(df_titanic[
B | C | Q | S | |
---|---|---|---|---|
1176 | False | False | False | True |
820 | False | False | False | True |
53 | False | True | False | False |
456 | False | True | False | False |
282 | False | False | False | True |
... | ... | ... | ... | ... |
1182 | False | True | False | False |
296 | False | False | False | True |
316 | False | False | False | True |
439 | False | True | False | False |
677 | False | True | False | False |
2207 rows × 4 columns
We may also want to change the levels of a categorical variable. A categorical variable can be transformed by mapping its levels to new levels. For example, we may only be interested in whether a person on the titanic was a passenger or a crew member. The variable class
is too detailed. We can create a new variable, type
, that is derived from the existing variable class
. Observations with a class
of “1st”, “2nd”, or “3rd” get a value of “passenger”, while observations with a class
of “victualling crew”, “engineering crew”, or “deck crew” get a value of “crew”.
"type"] = df_titanic["class"].map({
df_titanic["1st": "passenger",
"2nd": "passenger",
"3rd": "passenger",
"victualling crew": "crew",
"engineering crew": "crew",
"deck crew": "crew"
})
df_titanic
name | gender | age | class | embarked | country | ticketno | fare | survived | nonsense | age_cat | female | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1176 | Svensson, Mr. Johan | male | 7.400000 | 3rd | S | Sweden | 347060.0 | 1.967196 | 0 | 0.265837 | child | 0 | passenger |
820 | Mitchell, Mr. Henry Michael | male | 7.200000 | 2nd | S | England | 24580.0 | 2.312535 | 0 | 0.321185 | child | 0 | passenger |
53 | Artagaveytia, Mr. Ramon | male | 7.100000 | 1st | C | Argentina | 17609.0 | 3.893861 | 0 | 0.548431 | child | 0 | passenger |
456 | Goldschmidt, Mr. George B. | male | 7.100000 | 1st | C | United States | 17754.0 | 3.530180 | 0 | 0.497208 | child | 0 | passenger |
282 | Crosby, Captain. Edward Gifford | male | 7.000000 | 1st | S | United States | 5735.0 | 4.262680 | 0 | 0.608954 | child | 0 | passenger |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1182 | Tannūs, Master. As'ad | male | 0.041667 | 3rd | C | Lebanon | 2625.0 | 2.091913 | 1 | 50.205923 | child | 0 | passenger |
296 | Danbom, Master. Gilbert Sigvard Emanuel | male | 0.033333 | 3rd | S | Sweden | 347080.0 | 2.644755 | 0 | 79.342661 | child | 0 | passenger |
316 | Dean, Miss. Elizabeth Gladys 'Millvina' | female | 0.016667 | 3rd | S | England | 2315.0 | 3.001247 | 1 | 180.074822 | child | 1 | passenger |
439 | Gheorgheff, Mr. Stanio | male | NaN | 3rd | C | Bulgaria | 349254.0 | 1.970059 | 0 | NaN | NaN | 0 | passenger |
677 | Kraeff, Mr. Theodor | male | NaN | 3rd | C | Bulgaria | 349253.0 | 1.970059 | 0 | NaN | NaN | 0 | passenger |
2207 rows × 13 columns
4.3.5 Summarizing Rows
Summarization of the rows of a dataset for column(s) of interest can take many different forms. This introduction will not be exhaustive, but certainly cover the basics.
4.3.5.1 Summarizing a Quantitative Variable
There are a few descriptive statistics that can be computed directly including, but not limited to, the mean and median.
"age"].mean()
df_titanic[
"age"].median()
df_titanic[
"age", "fare"]].mean() df_titanic[[
age 3.043673
fare 2.918311
dtype: float64
We can ask for a slightly more comprehensive description using .describe()
"age"].describe()
df_titanic[
df_titanic.describe()
age | ticketno | fare | survived | nonsense | female | |
---|---|---|---|---|---|---|
count | 2205.000000 | 1.316000e+03 | 1291.000000 | 2207.000000 | 1289.000000 | 2207.000000 |
mean | 3.043673 | 2.842157e+05 | 2.918311 | 0.322157 | 2.147877 | 0.221568 |
std | 1.215968 | 6.334726e+05 | 0.974452 | 0.467409 | 7.237694 | 0.415396 |
min | 0.016667 | 2.000000e+00 | 1.108728 | 0.000000 | 0.265837 | 0.000000 |
25% | 2.200000 | 1.426225e+04 | 1.971383 | 0.000000 | 0.742371 | 0.000000 |
50% | 2.900000 | 1.114265e+05 | 2.645480 | 0.000000 | 0.936833 | 0.000000 |
75% | 3.800000 | 3.470770e+05 | 3.435945 | 1.000000 | 1.260935 | 0.000000 |
max | 7.400000 | 3.101317e+06 | 6.238443 | 1.000000 | 180.074822 | 1.000000 |
Note that, by default, .describe()
provides descriptive statistics for only the quantitative variables in the dataset.
We can enhance numerical summaries with .groupby()
, which allows us to specify one or more variables that we’d like to group our work by.
"age", "survived"]].groupby("survived").mean() df_titanic[[
age | |
---|---|
survived | |
0 | 3.083194 |
1 | 2.960631 |
4.3.5.2 Summarizing a Categorical Variable
When it comes to categorical variables we’re most often interested in frequency distributions (counts), relative frequency distributions, and cross-tabulations.
"class"].unique()
df_titanic[
"class"].describe() df_titanic[
count 2207
unique 7
top 3rd
freq 709
Name: class, dtype: object
The .unique()
here allows us to see the unique values of the class
variable. Notice that the results of .describe()
on a categorical variable are much different.
To completely summarize a single categorical variable, we report the number of times each level appeared, or its frequency.
"class"].value_counts() df_titanic[
class
3rd 709
victualling crew 431
1st 324
engineering crew 324
2nd 284
restaurant staff 69
deck crew 66
Name: count, dtype: int64
Instead of reporting counts, we can also report proportions or probabilities, or the relative frequencies. We can calculate the relative frequencies by specifying normalize=True
in .value_counts()
.
"class"].value_counts(normalize=True) df_titanic[
class
3rd 0.321251
victualling crew 0.195288
1st 0.146806
engineering crew 0.146806
2nd 0.128681
restaurant staff 0.031264
deck crew 0.029905
Name: proportion, dtype: float64
Cross-tabulations are one way we can investigate possible relationships between categorical variables. For example, what can we say about the relationship between gender
and survival
on the Titanic?
This does not tell us how gender
interacts with survival
. To do that, we need to produce a cross-tabulation, or a “cross-tab” for short. (Statisticians tend to call this a contingency table or a two-way table.)
"survived"], df_titanic["gender"]) pd.crosstab(df_titanic[
gender | female | male |
---|---|---|
survived | ||
0 | 130 | 1366 |
1 | 359 | 352 |
A cross-tabulation of two categorical variables is a two-dimensional array, with the levels of one variable along the rows and the levels of the other variable along the columns. Each cell in this array contains the number of observations that had a particular combination of levels. So in the Titanic data set, there were 359 females who survived and 1366 males who died. From the cross-tabulation, we can see that there were more females who survived than not, while there were more males who died than not. Clearly, gender had a strong influence on survival because of the Titanic’s policy of “women and children first”.
To get probabilities instead of counts, we specify normalize=True
.
"survived"], df_titanic["gender"], normalize=True) pd.crosstab(df_titanic[
gender | female | male |
---|---|---|
survived | ||
0 | 0.058903 | 0.618940 |
1 | 0.162664 | 0.159493 |