6 Graph basics
Decline by Randall Munroe (xkcd.com) is licensed under CC BY-NC 2.5
6.1 Introduction
This tutorial is an introduction to ggplot2 adapted from Chapter 3 from [1]. If you already have R experience, you might still want to browse this section in case you find something new.
Prerequisites should be completed before proceeding. After that, the tutorial should take about an hour.
- As you work through the tutorial, type a line or chunk of code then File > Save and run the script.
- Confirm that your result matches the tutorial result.
- The exercises give you chance to practice your new skills to learn by doing (but you knew that already)!
6.1.1 Download prepared data
Data sets used in the exercises have been prepared and saved in a .zip file in the workshop repository. The data are in .rds format, a native R format for single files that preserves variable types, including the order (if any) of factors.
Download the prepared_data.zip
file from the workshop website with the following code.
- You can copy and paste the code to the Console; you only have to run this once.
- The destination file assumes you have a
data
directory in your project.
<- paste0("https://github.com/MIDFIELDR/2021-asee-workshop/",
zip_url "raw/main/data/prepared_data.zip")
download.file(zip_url, destfile = "data/prepared_data.zip")
If the download is unsuccessful
- Navigate to the
prepared_data.zip
repository. - Click Download
Once the download is successful
- Extract the compressed files from the downloaded .zip file.
- Manually move the files into the top level of the workshop
data
folder. Your workshop data directory should now contain:
data\
hours_per_term.rds
sat.rds
stickiness.rds student_demogr.rds
6.1.2 Start a new script
Create a new script for this tutorial.
- See Create a script if you need a refresher on creating, saving, and running an R script.
- At the top of the script add a minimal header and install and load the packages indicated.
# Graph basics
# Name
# Date
# Packages used in this tutorial
library("midfieldr")
library("midfielddata")
library("data.table")
library("ggplot2")
library("gapminder")
# Optional code to control data.table printing
options(
datatable.print.nrows = 10,
datatable.print.topn = 5,
datatable.print.class = TRUE
)
# Load midfielddata data sets to use later
data(student)
data(term)
If you get an error like this one after running the script,
Error in library("gapminder") : there is no package called 'gapminder'
then the package needs to be installed. If you need a refresher on installing packages, see Install CRAN packages. Once the missing package is installed, you can rerun the script.
6.2 Expected data structure
Data for analysis and graphing are often laid out in “block record” or “long” form with every key variable and response variable in their own columns [2]. Database designers call this a “denormalized” form; many R users would recognize it as the so-called “tidy” form [3].
We use this form regularly for preparing data for graphing using the ggplot2 package. The gapminder data we’re using in this tutorial is in block-record form. To view its help page, run
library("gapminder")
? gapminder
# Convert the data frame to a data.table structure
<- data.table(gapminder) gapminder
And we can just type its name to see a few rows. Note at the top of each column under the column name, the class of the variable is shown: factor <fctr>
, integer <int>
, and double-precision <num>
.
gapminder#> country continent year lifeExp pop gdpPercap
#> <fctr> <fctr> <int> <num> <int> <num>
#> 1: Afghanistan Asia 1952 28.801 8425333 779.4453
#> 2: Afghanistan Asia 1957 30.332 9240934 820.8530
#> 3: Afghanistan Asia 1962 31.997 10267083 853.1007
#> 4: Afghanistan Asia 1967 34.020 11537966 836.1971
#> 5: Afghanistan Asia 1972 36.088 13079460 739.9811
#> ---
#> 1700: Zimbabwe Africa 1987 62.351 9216418 706.1573
#> 1701: Zimbabwe Africa 1992 60.377 10704340 693.4208
#> 1702: Zimbabwe Africa 1997 46.809 11404948 792.4500
#> 1703: Zimbabwe Africa 2002 39.989 11926563 672.0386
#> 1704: Zimbabwe Africa 2007 43.487 12311143 469.7093
<- gapminder$continent
x attributes(x)
#> $levels
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
#>
#> $class
#> [1] "factor"
6.2.1 Exercise
- Examine the
student
data from midfielddata. (Type its name in the console.) - How many variables? How many observations?
- How many of the variables are numeric? How many are character type?
- Is the data set in block-record form?
- Check your work by comparing your result to the
student
help page (link below).
Help pages for more information:
6.3 Anatomy of a graph
ggplot()
is a our basic plotting function. The data = ...
argument assigns the data frame. The plot is empty because we haven’t mapped the data to coordinates yet.
ggplot(data = gapminder)
Next we use the mapping argument mapping = aes(...)
to assign variables (column names) from the data frame to specific aesthetic properties of the graph such as the x-coordinate, the y-coordinate color, fill, etc.
Here we map continent (a categorical variable) to x
and life expectancy (a quantitative variable) to y
. To reduce the number of times we repeat lines of code, we can assign a name (life_exp
) to the empty graph to which we can add layers later.
# Demonstrate aesthetic mapping
<- ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp)) life_exp
To reduce typing, the first two arguments data
and mapping
are often used without naming them explicitly, e.g.,
# Demonstrate implicit data and mapping arguments
<- ggplot(gapminder, aes(x = continent, y = lifeExp)) life_exp
If we print the graph by typing the name of the graph object (everything in R is an object), we get a graph with a range on each axis (from the mapping) but no data shown. We haven’t specified the type of visual encoding we want.
# Examine the result
life_exp
A box-and-whisker plot (or box plot) is designed for displaying the distribution of a single quantitative variable. The visual encoding is specified using the geom_boxplot()
layer, where a “geom” is a geometric object. The geom_boxplot()
function requires the quantitative variable assigned to y
and the categorical variable (if any) to x
.
# Demonstrate adding a geometric object
<- life_exp +
life_exp geom_boxplot()
# Examine the result
life_exp
Notice that the default axis labels are the variables names from the data frame. We can edit those with another layer
# Demonstrate editing axis labels
<- life_exp +
life_exp labs(x = "Continent", y = "Life expectancy (years)")
# Examine the result
life_exp
Next, we often want the categorical variable ordered by the quantitative variable instead of alphabetically. Because continent
is a factor, we can use the reorder()
function inside the aes()
argument to order the boxplots by the median life expectancy per continent. For more information on ordering data, see Ordering factors.
# Demonstrate reordering a categorical variable
+
life_exp aes(x = reorder(continent, lifeExp, median), y = lifeExp)
Summary. The basics steps for building up the layers of any graph:
- assign the data frame
- map variables (columns names) to aesthetic properties
- choose geoms
- adjust scales, labels, ordering, etc.
Lastly, while we separate the layers as we work to focus on that specific layer, the layers can always be written in a single code chunk, e.g,
ggplot(gapminder, aes(x = reorder(continent, lifeExp), y = lifeExp)) +
geom_boxplot() +
labs(x = "Continent", y = "Life expectancy (years)")
6.3.1 Exercise
- Examine the
term
data set from midfielddata. - Create a boxplot of the hours per term quantity conditioned by the student level.
- What is the rational for leaving the categorical variable in its native order?
- Check your work by comparing your result to the graph below.
Help pages for more information:
6.4 Layer: points
A two-dimensional scatterplot reveals the strength of the relationship between two quantitative variables. The ggplot geom for scatterplots is geom_point()
. To illustrate a scatterplot, we graph life expectancy as a function of GDP.
<- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
life_gdp geom_point() +
labs(x = "GDP per capita (USD)", y = "Life expectancy (years)")
life_gdp
Help pages for more information:
6.5 Layer: smooth fit
Suppose you wanted a smooth fit curve, not necessarily linear. Add a geom_smooth()
layer. The name loess (pronounced like the proper name Lois) is a nonparametric curve-fitting method based on local regression.
+
life_gdp geom_smooth(method = "loess", se = FALSE)
The se
argument controls whether or not the confidence interval is displayed. Setting se = TRUE
yields,
+
life_gdp geom_smooth(method = "loess", se = TRUE)
For a linear-fit layer, we change method
to lm
(short for linear model). The linear fit is not particularly good in this case, but now you know how to do one.
+
life_gdp geom_smooth(method = "lm", se = TRUE)
Help pages for more information:
6.5.1 Exercise
A data set has been extracted from the midfieldr student
table with a sample of 3000 student SAT scores. These data, sat.rds
are part of the prepared data from the .zip
files you downloaded earlier. Here we read the data in using readRDS()
.
# Prepared data from the downloaded zip file
<- readRDS("data/sat.rds") sat
- Use the
sat
data and create a scatterplot of verbal scoressat_verbal
as a function of math scoressat_math
.
- Add a loess fit.
- Check your work by comparing your result to the graph below.
6.6 Layer: scale
We have orders of magnitude differences in the GDP per capita variable. To confirm, we can create a summary()
of the gdpPercap
variable. The output shows that the minimum is 241, the median 3532, and the maximum 113,523.
# statistical summary of one variable
summary(gapminder[, gdpPercap])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 241.2 1202.1 3531.8 7215.3 9325.5 113523.1
In exploring a graph like this, it might be useful to add a layer that changes the horizontal scale to a log-base-10 scale.
<- life_gdp +
life_gdp scale_x_continuous(trans = "log10")
Update the axis labels,
<- life_gdp +
life_gdp labs(x = "GDP per capita, USD (log10 scale)",
y = "Life expectancy (years)")
# display the graph life_gdp
In summary, all the layers could have been be coded at once, for example,
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_continuous(trans = "log10") +
labs(x = "GDP per capita, USD (log10 scale)",
y = "Life expectancy (years)")
With all the layers in one place, we can see that we’ve coded all the basic steps, that is,
- assign the data frame
- map variables (columns names) to aesthetic properties
- choose geoms
- adjust scales, labels, ordering, etc.
6.6.1 Exercise
The prepared data you downloaded earlier includes the file student_demogr.rds
. The data in this file is a summary of the midfieldr student
table with the number of students by race/ethnicity and sex, omitting “International” and “Other/Unknown” race values. Again, we read the data in using readRDS()
.
# Prepared data from the downloaded zip file
<- readRDS("data/student_demogr.rds") student_demogr
- Use the
student_demogr
data and reproduce the graph shown below.
- Use a log-base-2 scale.
- Omit the y-axis label by setting
y = ""
in thelabs()
argument.
Help pages for more information:
6.7 Mapping columns to aesthetics
Mappings in the aes()
function of ggplot()
can involve the names of variables (column s) only. So far, the only mappings we’ve used are from column names to an x or y aesthetic.
Another useful mapping is from a column name to the color
argument, which then separates the data by the values of the categorical variable selected and automatically creates the appropriate legend.
Here we map the continent
column to the color
aesthetic, adding a third data variable to the display.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_continuous(trans = "log10") +
labs(x = "GDP per capita, USD (log10 scale)",
y = "Life expectancy (years)")
6.7.1 Exercise
The prepared data you downloaded earlier includes the file hours_per_term.rds
. Again, we read the data using readRDS()
.
<- readRDS("data/hours_per_term.rds") hours_per_term
- Use
hours_per_term
data to create a boxplot of hours per term as a function of level. - Add a third column name to
aes()
to addsex
by color to the graph. - Swap the x, y mapping to obtain a horizontal boxplot.
- Check your work by comparing your result to the graph below.
Help pages for more information:
6.8 Layer: facets
In the earlier graph where we mapped continent to color, there was a lot of overprinting, making it difficult to compare the continents. Instead of using color to distinguish the continents, we can plot in different panels by continent.
The facet_wrap()
layer separates the data into different panels (or facets). Like the aes()
mapping, facet_wrap()
is applied to a variable (column name) in the data frame.
<- life_gdp +
life_gdp facet_wrap(facets = vars(continent))
# print the graph life_gdp
Comparisons are facilitated by having the facets appear in one column, by using the ncol
argument of facet_wrap()
.
<- life_gdp + facet_wrap(facets = vars(continent), ncol = 1)
life_gdp
# print the graph life_gdp
In a faceted display, all panels have identical scales (the default) to facilitate comparison. Again, all the layers could have been be coded at once, for example,
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
facet_wrap(facets = vars(continent), ncol = 1) +
geom_point() +
scale_x_continuous(trans = "log10") +
labs(x = "GDP per capita, USD (log10 scale)",
y = "Life expectancy (years)")
6.8.1 Exercise
The prepared data you downloaded earlier includes the file stickiness.rds
. Again, we read the data using readRDS()
.
<- readRDS("data/stickiness.rds") stickiness
- Use the
stickiness
data frame and plot stickiness (x-axis) as a function of race/ethnicity/sex (y-axis) and faceted by program.
- When that graph seems OK, add a third column name to
aes()
to addsex
by color to the graph. - Check your work by comparing your result to the graph below.
Help pages for more information:
6.9 Ordering factors
Earlier, we used reorder()
to order life expectancy boxplots by the increasing median life expectancy per continent. In general, we use reorder()
to order every categorical variable we intend to display graphically.
A factor is special data structure in R for categorical variables. In a factor, the levels of the category—typically character strings—are known and fixed. However, factors are stored internally as integers—a critical design tool for meaningfully ordering the elements of a display involving categorical variables.
For example, reviewing again the gapminder data frame, we see that the first two columns are factors.
gapminder#> country continent year lifeExp pop gdpPercap
#> <fctr> <fctr> <int> <num> <int> <num>
#> 1: Afghanistan Asia 1952 28.801 8425333 779.4453
#> 2: Afghanistan Asia 1957 30.332 9240934 820.8530
#> 3: Afghanistan Asia 1962 31.997 10267083 853.1007
#> 4: Afghanistan Asia 1967 34.020 11537966 836.1971
#> 5: Afghanistan Asia 1972 36.088 13079460 739.9811
#> ---
#> 1700: Zimbabwe Africa 1987 62.351 9216418 706.1573
#> 1701: Zimbabwe Africa 1992 60.377 10704340 693.4208
#> 1702: Zimbabwe Africa 1997 46.809 11404948 792.4500
#> 1703: Zimbabwe Africa 2002 39.989 11926563 672.0386
#> 1704: Zimbabwe Africa 2007 43.487 12311143 469.7093
A factor has at least two attributes, its class (all R objects have class) and its levels.
# Examine a factor variable
<- gapminder$continent
x
class(x)
#> [1] "factor"
levels(x)
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
The levels are the character strings we see as data values in the data frame column. However, factors are stored in memory as type integer, as shown by the typeof()
function.
# Factors are stored internally as integers
typeof(x)
#> [1] "integer"
If we unclass()
the variable, we remove the class and reveal the hidden integers. For example, let’s first print out 64 values of continent
,
# Data values are character strings
1:64]
x[#> [1] Asia Asia Asia Asia Asia Asia Asia Asia
#> [9] Asia Asia Asia Asia Europe Europe Europe Europe
#> [17] Europe Europe Europe Europe Europe Europe Europe Europe
#> [25] Africa Africa Africa Africa Africa Africa Africa Africa
#> [33] Africa Africa Africa Africa Africa Africa Africa Africa
#> [41] Africa Africa Africa Africa Africa Africa Africa Africa
#> [49] Americas Americas Americas Americas Americas Americas Americas Americas
#> [57] Americas Americas Americas Americas Oceania Oceania Oceania Oceania
#> Levels: Africa Americas Asia Europe Oceania
Then remove the class to reveal the hidden integers.
# But the levels are stored as integers
unclass(x[1:64])
#> [1] 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [39] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5
#> attr(,"levels")
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
When factors are graphed, the integers determine the increasing order in which the levels are plotted. When we use reorder()
, we are shuffling the mapping of integers to levels.
We can order the levels of continent
by the median life expectancy per continent using,
<- gapminder$continent
x <- gapminder$lifeExp
y <- reorder(x, y, median) z
Examining the reordered factor z
, notice the change in order of the levels in the printout. The levels are no longer in alphabetical order: the Americas and Asia have swapped positions. Asia is now mapped to the integer 2 where before it was mapped to 3.
levels(z)
#> [1] "Africa" "Asia" "Americas" "Europe" "Oceania"
1:64]
z[#> [1] Asia Asia Asia Asia Asia Asia Asia Asia
#> [9] Asia Asia Asia Asia Europe Europe Europe Europe
#> [17] Europe Europe Europe Europe Europe Europe Europe Europe
#> [25] Africa Africa Africa Africa Africa Africa Africa Africa
#> [33] Africa Africa Africa Africa Africa Africa Africa Africa
#> [41] Africa Africa Africa Africa Africa Africa Africa Africa
#> [49] Americas Americas Americas Americas Americas Americas Americas Americas
#> [57] Americas Americas Americas Americas Oceania Oceania Oceania Oceania
#> Levels: Africa Asia Americas Europe Oceania
unclass(z[1:64])
#> [1] 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [39] 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5
#> attr(,"levels")
#> [1] "Africa" "Asia" "Americas" "Europe" "Oceania"
The purpose of this detailed examination is to lay the foundation for using reorder()
to condition categorical variables for graphing. One of the most common applications is to order the panels and rows of a multi-facet graph. Panels (facets) by default are nearly always ordered alphabetically. In most cases, ordering the panels by the data improves the display.
In the next example, we again order the continents by the median life expectancy per continent, but here the ordering affects the order of the panels.
# Create a new memory location
<- copy(gapminder)
dframe
# Order the levels of the factor
:= reorder(continent, lifeExp, median)] dframe[, continent
We graph using much the same code chunk as before with one addition. We add the as.table = FALSE
argument to the facet_wrap()
function. “Table-order” of panels is increasing from top to bottom; what we want is “graph-order,” increases (like a graph scale) from bottom to top.
ggplot(dframe, aes(x = gdpPercap, y = lifeExp)) +
facet_wrap(facets = vars(continent), ncol = 1, as.table = FALSE) +
geom_point() +
scale_x_continuous(trans = "log10") +
labs(x = "GDP per capita, USD (log10 scale)",
y = "Life expectancy (years)")
6.9.1 Exercise
Continue using the stickiness
data frame from the previous section.
- The
race_sex
andprogram
columns are factors. Order both factors by the stickiness variable. - Use
as.table
to impose “graph-order” on the panels. - Check your work by comparing your result to the graph below.
Help pages for more information:
6.10 Next steps
That concludes our brief introduction to graph basics using ggplot2. To continue, select your preferred next step in your progression.