3  Data Visualization in Python

3.1 Introduction

This document demonstrates the use of the plotnine library in Python to visualize data via the grammar of graphics framework.

The functions in plotnine originate from the ggplot2 R package, which is the R implementation of the grammar of graphics.

3.2 Grammar of Graphics

The grammar of graphics is a framework for creating data visualizations.

A visualization consists of:

  • The aesthetic: Which variables are dictating which plot elements.

  • The geometry: What shape of plot your are making.

For example, the plot below displays some of the data from the Palmer Penguins data set.

First, though, we need to load the Palmer Penguins dataset.

Note

If you do not have the pandas library installed then you will need to run

pip install pandas

in the Jupyter terminal to install. Same for any other libraries you haven’t installed.

import pandas as pd
from palmerpenguins import load_penguins
from plotnine import ggplot, geom_point, aes, geom_boxplot
penguins = load_penguins()

(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
)

The aesthetic is species on the x-axis, bill_length_mm on the y-axis, colored by species.

The geometry is a boxplot.

Check In

Take a look at the first page of the optional reading for plotnine. In groups of 3-4, discuss the differences between how they use plotnine and the way we used it in the code chunk above.

3.3 plotnine (i.e. ggplot)

The plotnine library implements the grammar of graphics in Python.

Code for the previous example:

(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
)

3.3.1 The aesthetic

(ggplot(penguins, 
1aes(
2  x = "species",
  y = "bill_length_mm",
  fill = "species"))
+ geom_boxplot()
)
1
The aes() function is the place to specify aesthetics.
2
x, y, and fill are three possible aesthetics that can be specified, that map variables in our data set to plot elements.

3.3.2 The geometry

(ggplot(penguins, 
aes(
  x = "species",
  y = "bill_length_mm",
  fill = "species"))
1+ geom_boxplot()
)
1
A variety of geom_* functions allow for different plotting shapes (e.g. boxplot, histogram, etc.)

3.3.3 Other optional elements:

  • The scales of the x- and y-axes.

  • The color of elements that are not mapped to aesthetics.

  • The theme of the plot

…and many more!

3.3.4 Scales

(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
)

versus

from plotnine import scale_y_reverse
(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
+ scale_y_reverse()
)

3.3.5 Non-aesthetic colors

(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
)

versus

(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot(fill = "cornflowerblue")
)

Check In

What will this show?

(ggplot(penguins, 
aes(
  x = "species",
  y = "bill_length_mm",
  fill = "cornflowerblue"))
+ geom_boxplot()
)

3.3.6 Themes

(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
)

versus

from plotnine import theme_classic
(ggplot(penguins, aes(x = "species", y = "bill_length_mm", fill = "species"))
+ geom_boxplot()
+ theme_classic()
)

Example

What are the differences between the two plots above? What did the theme change?

Check In

What are the aesthetics, geometry, scales, and other options in the cartoon plot below?

An xkcd comic of time spent going up the down escalator

3.4 Geometries: The “Big Five”

3.4.1 1. Bar Plots

Most often used for showing counts of a categorical variable:

from plotnine import geom_bar
(ggplot(penguins,
aes(
  x = "species"
))
+ geom_bar()
)

… or relationships between two categorical variables:

(ggplot(penguins,
aes(
  x = "species",
  fill = "sex"
))
+ geom_bar()
)
TypeError: '<' not supported between instances of 'str' and 'float'

Would we rather see percents?

(ggplot(penguins,
aes(
  x = "species",
  fill = "island"
))
+ geom_bar(position = "fill")
)

Or side-by-side?

(ggplot(penguins,
aes(
  x = "species",
  fill = "island"
))
+ geom_bar(position = "dodge")
)

Example

Compare and contrast the plots above? What information is lost or gained between each of them?

3.4.2 2. Boxplots

(ggplot(penguins,
aes(
  x = "species",
  y = "bill_length_mm"
))
+ geom_boxplot()
)

Side-by-side using a categorical variable:

(ggplot(penguins,
aes(
  x = "species",
  y = "bill_length_mm",
  fill = "sex"
))
+ geom_boxplot()
)
TypeError: '<' not supported between instances of 'str' and 'float'

3.4.3 3. Histograms

from plotnine import geom_histogram
(ggplot(penguins,
aes(
  x = "bill_length_mm"
))
+ geom_histogram()
)

(ggplot(penguins,
aes(
  x = "bill_length_mm"
))
+ geom_histogram(bins = 100)
)

(ggplot(penguins,
aes(
  x = "bill_length_mm"
))
+ geom_histogram(bins = 10)
)

3.4.4 3.5 Densities

Suppose you want to compare histograms by category:

(ggplot(penguins,
aes(
  x = "bill_length_mm",
  fill = "species"
))
+ geom_histogram()
)

Cleaner: smoothed histogram, or density:

from plotnine import geom_density
(ggplot(penguins,
aes(
  x = "bill_length_mm",
  fill = "species"
))
+ geom_density()
)

Even cleaner: The alpha option:

(ggplot(penguins,
aes(
  x = "bill_length_mm",
  fill = "species"
))
+ geom_density(alpha = 0.5)
)

3.4.5 4. Scatterplots

(ggplot(penguins,
aes(
  x = "bill_length_mm",
  y = "bill_depth_mm"
))
+ geom_point()
)

Colors for extra information:

(ggplot(penguins,
aes(
  x = "bill_length_mm",
  y = "bill_depth_mm",
  color = "species"
))
+ geom_point()
)

3.4.6 5. Line Plots

from plotnine import geom_line
penguins2 = penguins.groupby(by = ["species", "sex"]).mean()

(ggplot(penguins2,
aes(
  x = "species",
  y = "bill_length_mm",
  color = "sex"
))
+ geom_point()
+ geom_line()
)
TypeError: agg function failed [how->mean,dtype->object]

3.5 Multiple Plots

3.5.1 Facet Wrapping

from plotnine import facet_wrap
(ggplot(penguins,
aes(
  x = "species",
  y = "bill_length_mm"
))
+ geom_boxplot()
+ facet_wrap("sex")
)

Practice Activity

Open up this colab notebook and make a copy.

Fill out the sections where indicated, render it to html with Quarto, and push your final notebook and html document to a repository on GitHub (same one as Practice Activity 1.1 is good). Then share this repository link in the quiz question.