4 Tabular Data and Basic Data Operations

4.1 Introduction

This document demonstrates the use of the pandas library in Python to do basic data wrangling and summarization.

Note

If you do not have the pandas library installed then you will need to run

pip install pandas

in the Jupyter terminal to install. Remember: you only need to install once per machine (or Colab session, for packages that don’t come pre-installed).

4.2 Reading Tabular Data into Python

We’re going to be exploring pandas in the context of the famous Titanic dataset. We’ll work with a subset of this dataset, but more information about it all can be found here.

We start by loading the numpy and pandas libraries. Most of our data wrangling work will happen with functions from the pandas library, but the numpy library will be useful for performing certain mathematical operations should we choose to transform any of our data.

import numpy as np
import pandas as pd

data_dir = "https://dlsun.github.io/pods/data/"
df_titanic = pd.read_csv(data_dir + "titanic.csv")

Example

We’ve already seen read_csv() used many times for importing CSV files into Python, but it bears repeating.

Data files of many different types and shapes can be read into Python with similar functions, but we will focus on tabular data.

4.2.1 Tidy Data is Special Tabular Data

For most people, the image that comes to mind when thinking about data is indeed something tabular or spreadsheet-like in nature. Which is great!

Tabular data is a form preferred by MANY different data operations and work. However, we will want to take this one step further. In almost all data science work we want our data to be tidy

Note

A dataset is tidy if it adheres to following three characteristics:

Every column is a variable
Every row is an observation
Every cell is a single value

Check In

With 2-3 people around you, navigate to the GapMinder Data site and download a single CSV file of your choice. Open it up in Excel or your application of choice. Is this dataset tidy? If not, then what would have to change to make it tidy?

Learn More

The term “tidy data” was first popularized in this paper by R developer Hadley Wickham.

You may have noticed that plotnine (ggplot) is basically built to take tidy data. Variables are specified in the aesthetics function to map them (i.e. columns) in our dataset to plot elements. This type of behavior is EXTREMELY common among functions that work with data in all languages, and so the importance of getting our data into a tidy format cannot be overstated.

In Python, there are at least two quick ways to view a dataset we’ve read in:

df_titanic

	name	gender	age	class	embarked	country	ticketno	fare	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	S	United States	5547.0	7.11	0
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	S	United States	2673.0	20.05	0
2	Abbott, Mr. Rossmore Edward	male	16.0	3rd	S	United States	2673.0	20.05	0
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	S	England	2673.0	20.05	1
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	S	Norway	348125.0	7.13	1
...	...	...	...	...	...	...	...	...	...
2202	Wynn, Mr. Walter	male	41.0	deck crew	B	England	NaN	NaN	1
2203	Yearsley, Mr. Harry	male	40.0	victualling crew	S	England	NaN	NaN	1
2204	Young, Mr. Francis James	male	32.0	engineering crew	S	England	NaN	NaN	0
2205	Zanetti, Sig. Minio	male	20.0	restaurant staff	S	England	NaN	NaN	0
2206	Zarracchi, Sig. L.	male	26.0	restaurant staff	S	England	NaN	NaN	0

2207 rows × 9 columns

df_titanic.head()

	name	gender	age	class	embarked	country	ticketno	fare	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	S	United States	5547.0	7.11	0
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	S	United States	2673.0	20.05	0
2	Abbott, Mr. Rossmore Edward	male	16.0	3rd	S	United States	2673.0	20.05	0
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	S	England	2673.0	20.05	1
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	S	Norway	348125.0	7.13	1

The latter (.head()) is usually preferred in case the dataset is large.

Check In

Does the titanic dataset appear to be in tidy format?

4.3 The “Big Five” Verbs of Data Wrangling

Data wrangling can involve a lot of different steps and operations to get data into a tidy format and ready for analysis and visualization. The vast majority of these fall under the umbrella one the following five operations:

Select columns/variables of interest
Filter rows/observations of interest
Arrange the rows of a dataset by column(s) of interest (i.e. order or sort)
Mutate the columns of a dataset (i.e. create or transform variables)
Summarize the rows of a dataset for column(s) of interest

4.3.1 Select Columns/Variables

Suppose we want to select the age variable from the titanic DataFrame. There are three ways to do this.

Use .loc, specifying both the rows and columns. (The colon : is Python shorthand for “all”.)

df_titanic.loc[:, "age"]

Access the column as you would a key in a dict.

df_titanic["age"]

Access the column as an attribute of the DataFrame.

df_titanic.age

Method 3 (attribute access) is the most concise. However, it does not work if the variable name contains spaces or special characters, begins with a number, or matches an existing attribute of the DataFrame. So, methods 1 and 2 are usually safer and preferred.

To select multiple columns, you would pass in a list of variable names, instead of a single variable name. For example, to select both age and fare, either of the two methods below would work (and produce the same result):

# Method 1
df_titanic.loc[:, ["age", "fare"]].head()

# Method 2
df_titanic[["age", "fare"]].head()

4.3.2 Filter Rows/Observations

4.3.2.1 Selecting Rows/Observations by Location

Before we see how to filter (i.e. subset) the rows of dataset based on some condition, let’s see how to select rows by explicitly identifying them.

We can select a row by its position using the .iloc attribute. Keeping in mind that the first row is actually row 0, the fourth row could be extracted as:

df_titanic.iloc[3]

name        Abbott, Mrs. Rhoda Mary 'Rosa'
gender                              female
age                                   39.0
class                                  3rd
embarked                                 S
country                            England
ticketno                            2673.0
fare                                 20.05
survived                                 1
Name: 3, dtype: object

Notice that a single row from a DataFrame is no longer a DataFrame but a different data structure, called a Series.

We can also select multiple rows by passing a list of positions to .iloc.

df_titanic.iloc[[1, 3]]

	name	gender	age	class	embarked	country	ticketno	fare	survived
1	Abbott, Mr. Eugene Joseph	male	13.0	3rd	S	United States	2673.0	20.05	0
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	S	England	2673.0	20.05	1

Notice that when we select multiple rows, we get a DataFrame back.

So a Series is used to store a single observation (across multiple variables), while a DataFrame is used to store multiple observations (across multiple variables).

If selecting consecutive rows, we can use Python’s slice notation. For example, the code below selects all rows from the fourth row, up to (but not including) the tenth row.

df_titanic.iloc[3:9]

	name	gender	age	class	embarked	country	ticketno	fare	survived
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	S	England	2673.0	20.0500	1
4	Abelseth, Miss. Karen Marie	female	16.0	3rd	S	Norway	348125.0	7.1300	1
5	Abelseth, Mr. Olaus Jørgensen	male	25.0	3rd	S	United States	348122.0	7.1300	1
6	Abelson, Mr. Samuel	male	30.0	2nd	C	France	3381.0	24.0000	0
7	Abelson, Mrs. Hannah	female	28.0	2nd	C	France	3381.0	24.0000	1
8	Abī-Al-Munà, Mr. Nāsīf Qāsim	male	27.0	3rd	C	Lebanon	2699.0	18.1509	1

4.3.2.2 Selecting Rows/Observations by Condition

We’ll often want to filter or subset the rows of a dataset based on some condition. To do this we’ll take advantage of vectorization and boolean masking.

Recall that we can compare the values of a variable/column to a particular value in the following way, and observe the result.

df_titanic["age"] > 30

0        True
1       False
2       False
3        True
4       False
        ...  
2202     True
2203     True
2204     True
2205    False
2206    False
Name: age, Length: 2207, dtype: bool

We can use these True and False values to filter/subset the dataset! The following subsets the titanic dataset down to only those individuals (rows) with ages over 30.

df_titanic[df_titanic["age"] > 30]

	name	gender	age	class	embarked	country	ticketno	fare	survived
0	Abbing, Mr. Anthony	male	42.0	3rd	S	United States	5547.0	7.1100	0
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	S	England	2673.0	20.0500	1
12	Ahlin, Mrs. Johanna Persdotter	female	40.0	3rd	S	Sweden	7546.0	9.0906	0
15	Aldworth, Mr. Augustus Henry	male	35.0	2nd	S	England	248744.0	13.0000	0
21	Allen, Mr. William Henry	male	39.0	3rd	S	England	373450.0	8.0100	0
...	...	...	...	...	...	...	...	...	...
2197	Worthman, Mr. William Henry	male	37.0	engineering crew	S	England	NaN	NaN	0
2200	Wright, Mr. William	male	40.0	victualling crew	S	England	NaN	NaN	1
2202	Wynn, Mr. Walter	male	41.0	deck crew	B	England	NaN	NaN	1
2203	Yearsley, Mr. Harry	male	40.0	victualling crew	S	England	NaN	NaN	1
2204	Young, Mr. Francis James	male	32.0	engineering crew	S	England	NaN	NaN	0

984 rows × 9 columns

We can combine multiple conditions using & (and) and | (or). The following subsets the titanic dataset down to females over 30 years of age.

df_titanic[(df_titanic["age"] > 30) & (df_titanic["gender"] == "female")]

	name	gender	age	class	embarked	country	ticketno	fare	survived
3	Abbott, Mrs. Rhoda Mary 'Rosa'	female	39.0	3rd	S	England	2673.0	20.0500	1
12	Ahlin, Mrs. Johanna Persdotter	female	40.0	3rd	S	Sweden	7546.0	9.0906	0
35	Andersson, Miss. Ida Augusta Margareta	female	38.0	3rd	S	Sweden	347091.0	7.1506	0
40	Andersson, Mrs. Alfrida Konstantia Brogren	female	39.0	3rd	S	Sweden	347082.0	31.0506	0
44	Andrews, Miss. Kornelia Theodosia	female	62.0	1st	C	United States	13502.0	77.1902	1
...	...	...	...	...	...	...	...	...	...
1997	Robinson, Mrs. Annie	female	41.0	victualling crew	S	England	NaN	NaN	1
2059	Smith, Miss. Katherine Elizabeth	female	45.0	victualling crew	S	England	NaN	NaN	1
2076	Stap, Miss. Sarah Agnes	female	47.0	victualling crew	S	England	NaN	NaN	1
2143	Wallis, Mrs. Catherine Jane	female	36.0	victualling crew	S	England	NaN	NaN	0
2145	Walsh, Miss. Catherine	female	32.0	victualling crew	S	Ireland	NaN	NaN	0

206 rows × 9 columns

Check In

With the 2-3 people around you, how would you find the just the names of the males under 20 years of age who survived (in the titanic dataset) with a single line of code?

4.3.3 Arrange Rows

As part of exploratory data analysis and some reporting efforts, we will want to sort a dataset or set of results by one or more variables of interest.

We can do this with .sort_values in either ascending or descending order.

The following sorts the titanic dataset by age in decreasing order.

df_titanic.sort_values(by = ["age"], ascending=False)

	name	gender	age	class	embarked	country	ticketno	fare	survived
1176	Svensson, Mr. Johan	male	74.000000	3rd	S	Sweden	347060.0	7.1506	0
820	Mitchell, Mr. Henry Michael	male	72.000000	2nd	S	England	24580.0	10.1000	0
53	Artagaveytia, Mr. Ramon	male	71.000000	1st	C	Argentina	17609.0	49.1001	0
456	Goldschmidt, Mr. George B.	male	71.000000	1st	C	United States	17754.0	34.1301	0
282	Crosby, Captain. Edward Gifford	male	70.000000	1st	S	United States	5735.0	71.0000	0
...	...	...	...	...	...	...	...	...	...
1182	Tannūs, Master. As'ad	male	0.416667	3rd	C	Lebanon	2625.0	8.1004	1
296	Danbom, Master. Gilbert Sigvard Emanuel	male	0.333333	3rd	S	Sweden	347080.0	14.0800	0
316	Dean, Miss. Elizabeth Gladys 'Millvina'	female	0.166667	3rd	S	England	2315.0	20.1106	1
439	Gheorgheff, Mr. Stanio	male	NaN	3rd	C	Bulgaria	349254.0	7.1711	0
677	Kraeff, Mr. Theodor	male	NaN	3rd	C	Bulgaria	349253.0	7.1711	0

2207 rows × 9 columns

Note

Notice that in these last few sections, we have not made any permanent changes to the df_titanic object. We have only asked python do some selecting/filtering/sorting and then to print out the results, not save them.

If we wanted df_titanic to become permanently sorted by age, we would re-assign the object:

df_titanic = df_titanic.sort_values(by = ["age"], ascending=False)

Warning

You will sometimes see object reassignment happen in a different way, using an inplace = True argument, like this:

df_titanic.sort_values(by = ["age"], ascending=False, inplace=True)

We strongly recommend against this approach, for two reason:

When an object is “overwritten” via reassignment, that’s a major decision; you lose the old version of the object. It should be made deliberately and obviously. The inplace argument is easy to miss when copying/editing code, so it can lead to accidental overwriting that is hard to keep track of.
Not all functions of DataFrames have an inplace option. It can be frustrating to get into the habit of using it, only to find out the hard way that it’s not available half the time!

4.3.4 Mutate Column(s)

The variables available to us in our original dataset contain all of the information we have access to, but the best insights may instead come from transformations of those variables.

4.3.4.1 Transforming Quantitative Variables

One of the simplest reasons to want to transform a quantitative variable is to change the measurement units.

Here we change the age of passengers from a value in years to a value in decades.

df_titanic["age"] = df_titanic["age"] / 10

If we have a quantitative variable that is particularly skewed, then it might be a good idea to transform the values of that variable…like taking the log of the values.

Note

This was a strategy you saw employed with the GapMinder data!

Below is an example of taking the log of the fare variable. Notice that we’re making use of the numpy here to take the log.

df_titanic["fare"] = np.log(df_titanic["fare"])

Remember that we can take advantage of vectorization here too. The following operation wouldn’t really make physical sense, but it’s an example of creating a new variable out of existing variables.

df_titanic["nonsense"] = df_titanic["fare"] / df_titanic["age"]

Note that we created the new variable, nonsense, by specifying on the left side of the = here and populating that column/variable via the expression on the right side of the =.

We could want to create a new variable by categorizing (or discretizing) the values of a quantitative variable (i.e. convert a quantitative variable to a categorical variable). We can do so with cut.

In the following, we create a new age_cat variable which represents whether a person is a child or an adult.

df_titanic["age_cat"] = pd.cut(df_titanic["age"],
                              bins = [0, 18, 100],
                              labels = ["child", "adult"])

Check In

Consider the four mutations we just performed. In which ones did we reassign a column of the dataset, thus replacing the old values with new ones? In which ones did we create a brand-new column, thus retaining the old column(s) that were involved in the calculation?

4.3.4.2 Transforming Categorical Variables

In some situations, especially later with modeling, we’ll need to convert categorical variables (stored as text) into quantitative (often coded) variables. Binary categorical variables can be converted into quantitative variables by coding one category as 1 and the other category as 0. (In fact, the survived column in the titanic dataset has already been coded this way.) The easiest way to do this is to create a boolean mask. For example, to convert gender to a quantitative variable female, which is 1 if the passenger was female and 0 otherwise, we can do the following:

df_titanic["female"] = 1 * (df_titanic["gender"] == "female")

What do we do about a categorical variable with more than twwo categories, like embarked, which has four categories? In general, a categorical variable with K categories can be converted into K separate 0/1 variables, or dummy variables. Each of the K dummy variables is an indicator for one of the K categories. That is, a dummy variable is 1 if the observation fell into its particular category and 0 otherwise.

Although it is not difficult to create dummy variables manually, the easiest way to create them is the get_dummies() function in pandas.

pd.get_dummies(df_titanic["embarked"])

	B	C	Q	S
1176	False	False	False	True
820	False	False	False	True
53	False	True	False	False
456	False	True	False	False
282	False	False	False	True
...	...	...	...	...
1182	False	True	False	False
296	False	False	False	True
316	False	False	False	True
439	False	True	False	False
677	False	True	False	False

2207 rows × 4 columns

We may also want to change the levels of a categorical variable. A categorical variable can be transformed by mapping its levels to new levels. For example, we may only be interested in whether a person on the titanic was a passenger or a crew member. The variable class is too detailed. We can create a new variable, type, that is derived from the existing variable class. Observations with a class of “1st”, “2nd”, or “3rd” get a value of “passenger”, while observations with a class of “victualling crew”, “engineering crew”, or “deck crew” get a value of “crew”.

df_titanic["type"] = df_titanic["class"].map({
    "1st": "passenger",
    "2nd": "passenger",
    "3rd": "passenger",
    "victualling crew": "crew",
    "engineering crew": "crew",
    "deck crew": "crew"
})

df_titanic

	name	gender	age	class	embarked	country	ticketno	fare	survived	nonsense	age_cat	female	type
1176	Svensson, Mr. Johan	male	7.400000	3rd	S	Sweden	347060.0	1.967196	0	0.265837	child	0	passenger
820	Mitchell, Mr. Henry Michael	male	7.200000	2nd	S	England	24580.0	2.312535	0	0.321185	child	0	passenger
53	Artagaveytia, Mr. Ramon	male	7.100000	1st	C	Argentina	17609.0	3.893861	0	0.548431	child	0	passenger
456	Goldschmidt, Mr. George B.	male	7.100000	1st	C	United States	17754.0	3.530180	0	0.497208	child	0	passenger
282	Crosby, Captain. Edward Gifford	male	7.000000	1st	S	United States	5735.0	4.262680	0	0.608954	child	0	passenger
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1182	Tannūs, Master. As'ad	male	0.041667	3rd	C	Lebanon	2625.0	2.091913	1	50.205923	child	0	passenger
296	Danbom, Master. Gilbert Sigvard Emanuel	male	0.033333	3rd	S	Sweden	347080.0	2.644755	0	79.342661	child	0	passenger
316	Dean, Miss. Elizabeth Gladys 'Millvina'	female	0.016667	3rd	S	England	2315.0	3.001247	1	180.074822	child	1	passenger
439	Gheorgheff, Mr. Stanio	male	NaN	3rd	C	Bulgaria	349254.0	1.970059	0	NaN	NaN	0	passenger
677	Kraeff, Mr. Theodor	male	NaN	3rd	C	Bulgaria	349253.0	1.970059	0	NaN	NaN	0	passenger

2207 rows × 13 columns

4.3.5 Summarizing Rows

Summarization of the rows of a dataset for column(s) of interest can take many different forms. This introduction will not be exhaustive, but certainly cover the basics.

4.3.5.1 Summarizing a Quantitative Variable

There are a few descriptive statistics that can be computed directly including, but not limited to, the mean and median.

df_titanic["age"].mean()

df_titanic["age"].median()

df_titanic[["age", "fare"]].mean()

age     3.043673
fare    2.918311
dtype: float64

We can ask for a slightly more comprehensive description using .describe()

df_titanic["age"].describe()

df_titanic.describe()

	age	ticketno	fare	survived	nonsense	female
count	2205.000000	1.316000e+03	1291.000000	2207.000000	1289.000000	2207.000000
mean	3.043673	2.842157e+05	2.918311	0.322157	2.147877	0.221568
std	1.215968	6.334726e+05	0.974452	0.467409	7.237694	0.415396
min	0.016667	2.000000e+00	1.108728	0.000000	0.265837	0.000000
25%	2.200000	1.426225e+04	1.971383	0.000000	0.742371	0.000000
50%	2.900000	1.114265e+05	2.645480	0.000000	0.936833	0.000000
75%	3.800000	3.470770e+05	3.435945	1.000000	1.260935	0.000000
max	7.400000	3.101317e+06	6.238443	1.000000	180.074822	1.000000

Note that, by default, .describe() provides descriptive statistics for only the quantitative variables in the dataset.

We can enhance numerical summaries with .groupby(), which allows us to specify one or more variables that we’d like to group our work by.

df_titanic[["age", "survived"]].groupby("survived").mean()

	age
survived
0	3.083194
1	2.960631

Check In

With 2-3 people around you, look up how you would compute the correlation between two quantitative variables in Python. Compute the correlation between the age and fare variables in the titanic dataset.

4.3.5.2 Summarizing a Categorical Variable

When it comes to categorical variables we’re most often interested in frequency distributions (counts), relative frequency distributions, and cross-tabulations.

df_titanic["class"].unique()

df_titanic["class"].describe()

count     2207
unique       7
top        3rd
freq       709
Name: class, dtype: object

The .unique() here allows us to see the unique values of the class variable. Notice that the results of .describe() on a categorical variable are much different.

To completely summarize a single categorical variable, we report the number of times each level appeared, or its frequency.

df_titanic["class"].value_counts()

class
3rd                 709
victualling crew    431
1st                 324
engineering crew    324
2nd                 284
restaurant staff     69
deck crew            66
Name: count, dtype: int64

Instead of reporting counts, we can also report proportions or probabilities, or the relative frequencies. We can calculate the relative frequencies by specifying normalize=True in .value_counts().

df_titanic["class"].value_counts(normalize=True)

class
3rd                 0.321251
victualling crew    0.195288
1st                 0.146806
engineering crew    0.146806
2nd                 0.128681
restaurant staff    0.031264
deck crew           0.029905
Name: proportion, dtype: float64

Cross-tabulations are one way we can investigate possible relationships between categorical variables. For example, what can we say about the relationship between gender and survival on the Titanic?

Check In

Summarize gender and survival individually by computing the frequency distributions of each.

This does not tell us how gender interacts with survival. To do that, we need to produce a cross-tabulation, or a “cross-tab” for short. (Statisticians tend to call this a contingency table or a two-way table.)

pd.crosstab(df_titanic["survived"], df_titanic["gender"])

gender	female	male
survived
0	130	1366
1	359	352

A cross-tabulation of two categorical variables is a two-dimensional array, with the levels of one variable along the rows and the levels of the other variable along the columns. Each cell in this array contains the number of observations that had a particular combination of levels. So in the Titanic data set, there were 359 females who survived and 1366 males who died. From the cross-tabulation, we can see that there were more females who survived than not, while there were more males who died than not. Clearly, gender had a strong influence on survival because of the Titanic’s policy of “women and children first”.

To get probabilities instead of counts, we specify normalize=True.

pd.crosstab(df_titanic["survived"], df_titanic["gender"], normalize=True)

gender	female	male
survived
0	0.058903	0.618940
1	0.162664	0.159493

Check In

What about conditional proportions? With 2-3 people around you, discuss how you would compute the proportion of females that survived and the proportion of males that survived and then do it.

Note, there are multiple ways to do this.

Practice Activity

Open up this colab notebook and make a copy.

Fill out the sections where indicated, render it to html with Quarto, and push your final notebook and html document to a repository on GitHub (same one as Practice Activity 1.1 is good). Then share this repository link in the quiz question.