import pandas as pd
from palmerpenguins import load_penguins
from plotnine import ggplot, geom_point, aes, geom_boxplot
3 Data Visualization in Python
3.1 Introduction
This document demonstrates the use of the plotnine
library in Python to visualize data via the grammar of graphics framework.
The functions in plotnine
originate from the ggplot2
R package, which is the R implementation of the grammar of graphics.
3.2 Grammar of Graphics
The grammar of graphics is a framework for creating data visualizations.
A visualization consists of:
The aesthetic: Which variables are dictating which plot elements.
The geometry: What shape of plot your are making.
For example, the plot below displays some of the data from the Palmer Penguins data set.
First, though, we need to load the Palmer Penguins dataset.
If you do not have the pandas
library installed then you will need to run
pip install pandas
in the Jupyter terminal to install. Same for any other libraries you haven’t installed.
= load_penguins()
penguins
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
)
The aesthetic is species on the x-axis, bill_length_mm on the y-axis, colored by species.
The geometry is a boxplot.
3.3 plotnine (i.e. ggplot)
The plotnine
library implements the grammar of graphics in Python.
Code for the previous example:
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
)
3.3.1 The aesthetic
3.3.2 The geometry
(ggplot(penguins,
aes(= "species",
x = "bill_length_mm",
y = "species"))
fill 1+ geom_boxplot()
)
- 1
-
A variety of
geom_*
functions allow for different plotting shapes (e.g. boxplot, histogram, etc.)
3.3.3 Other optional elements:
The scales of the x- and y-axes.
The color of elements that are not mapped to aesthetics.
The theme of the plot
…and many more!
3.3.4 Scales
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
)
versus
from plotnine import scale_y_reverse
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
+ scale_y_reverse()
)
3.3.5 Non-aesthetic colors
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
)
versus
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot(fill = "cornflowerblue")
)
(ggplot(penguins,
aes(= "species",
x = "bill_length_mm",
y = "cornflowerblue"))
fill + geom_boxplot()
)
3.3.6 Themes
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
)
versus
from plotnine import theme_classic
= "species", y = "bill_length_mm", fill = "species"))
(ggplot(penguins, aes(x + geom_boxplot()
+ theme_classic()
)
3.4 Geometries: The “Big Five”
3.4.1 1. Bar Plots
Most often used for showing counts of a categorical variable:
from plotnine import geom_bar
(ggplot(penguins,
aes(= "species"
x
))+ geom_bar()
)
… or relationships between two categorical variables:
(ggplot(penguins,
aes(= "species",
x = "sex"
fill
))+ geom_bar()
)
TypeError: '<' not supported between instances of 'str' and 'float'
Would we rather see percents?
(ggplot(penguins,
aes(= "species",
x = "island"
fill
))+ geom_bar(position = "fill")
)
Or side-by-side?
(ggplot(penguins,
aes(= "species",
x = "island"
fill
))+ geom_bar(position = "dodge")
)
3.4.2 2. Boxplots
(ggplot(penguins,
aes(= "species",
x = "bill_length_mm"
y
))+ geom_boxplot()
)
Side-by-side using a categorical variable:
(ggplot(penguins,
aes(= "species",
x = "bill_length_mm",
y = "sex"
fill
))+ geom_boxplot()
)
TypeError: '<' not supported between instances of 'str' and 'float'
3.4.3 3. Histograms
from plotnine import geom_histogram
(ggplot(penguins,
aes(= "bill_length_mm"
x
))+ geom_histogram()
)
(ggplot(penguins,
aes(= "bill_length_mm"
x
))+ geom_histogram(bins = 100)
)
(ggplot(penguins,
aes(= "bill_length_mm"
x
))+ geom_histogram(bins = 10)
)
3.4.4 3.5 Densities
Suppose you want to compare histograms by category:
(ggplot(penguins,
aes(= "bill_length_mm",
x = "species"
fill
))+ geom_histogram()
)
Cleaner: smoothed histogram, or density:
from plotnine import geom_density
(ggplot(penguins,
aes(= "bill_length_mm",
x = "species"
fill
))+ geom_density()
)
Even cleaner: The alpha option:
(ggplot(penguins,
aes(= "bill_length_mm",
x = "species"
fill
))+ geom_density(alpha = 0.5)
)
3.4.5 4. Scatterplots
(ggplot(penguins,
aes(= "bill_length_mm",
x = "bill_depth_mm"
y
))+ geom_point()
)
Colors for extra information:
(ggplot(penguins,
aes(= "bill_length_mm",
x = "bill_depth_mm",
y = "species"
color
))+ geom_point()
)
3.4.6 5. Line Plots
from plotnine import geom_line
= penguins.groupby(by = ["species", "sex"]).mean()
penguins2
(ggplot(penguins2,
aes(= "species",
x = "bill_length_mm",
y = "sex"
color
))+ geom_point()
+ geom_line()
)
TypeError: agg function failed [how->mean,dtype->object]
3.5 Multiple Plots
3.5.1 Facet Wrapping
from plotnine import facet_wrap
(ggplot(penguins,
aes(= "species",
x = "bill_length_mm"
y
))+ geom_boxplot()
+ facet_wrap("sex")
)