6 Graph basics

logo
Decline by Randall Munroe (xkcd.com) is licensed under CC BY-NC 2.5

Pre-Institute Workshop Agenda

This tutorial is an introduction to ggplot2 adapted from Chapter 3 from (Healy, 2019b). If you already have R experience, you might still want to browse this section in case you find something new.

If the prerequisites have been met, the tutorial should take no longer than 50 minutes.

6.1 Prerequisites

Every tutorial assumes that

  • You completed the Getting started instructions
  • You launched midfield_institute.Rproj to start the R session

Use File > New File > R Script to create a new R script

  • Name the script 02-graph-basics.R
  • Save it in the scripts directory
  • Add a minimal header at the top of the script (if you wish)
  • Use library() to load the packages we will use
# workshop graph basics 
# name 
# date 

library("tidyverse")
library("gapminder")

Run the script by clicking the Source button. If you see an error like this one,

Error in library("pkg_name") : there is no package called 'pkg_name'

then you should install the missing package(s) and run the script again.

[To review how to install an R package]

Guidelines

  • As you work through the tutorial, type a line or chunk of code then File > Save and Source.
  • Confirm that your result matches the tutorial result.
  • Your turn exercises give you chance to devise your own examples and check them out. You learn by doing (but you knew that already)!



▲ top of page

6.2 Tidy data

If the data set is “tidy”, then every row is an observation and every column is a variable. The gapminder data frame is tidy. We use glimpse() to get a look at the structure.

glimpse(gapminder)
#> Observations: 1,704
#> Variables: 6
#> $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

And we can just type its name to see the first few rows,

gapminder
#> # A tibble: 1,704 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ... with 1,694 more rows

Read more about tidy data in (Wickham and Grolemund, 2017).


Your turn. The ggplot2 package includes a dataset called mpg.

  • Use glimpse() to examine the data set.
  • How many variables? How many observations?
  • How many of the variables are numeric? How many are character type?
  • Is the data set tidy?

Check your work. There are 234 observations and 11 variables.



▲ top of page

6.3 Anatomy of a graph

ggplot() is a our basic plotting function. The data = ... argument assigns the data frame.

p <- ggplot(data = gapminder)

Next we use the mapping argument mapping = aes(...) to assign variables (column names) from the data frame to specific aesthetic properties of the graph such as the x-coordinate, the y-coordinate color, fill, etc.

Here we will map the GDP per capita variable to x and the life expectancy variable to y.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

If we try to print the graph by typing the name of the graph object (everything in R is an object), we get an empty graph because we haven’t told ggplot what sort of a graph we want.

p

Because the graph will be a scatterplot, we add the geom_point() layer.

p <- p +geom_point()

p # display the graph

In ggplot2, “geoms” are geometric objects such as points, lines, bars, boxplots, contours, polygons, etc. You can browse the full list on the ggplot2 geom reference page.

We could also have simply added the layer to the original object,

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point()

p # display the graph

Notice that the default axis labels are the variables names from the data frame. We can edit those with another layer

p <- p + labs(x = "GDP per capita", y = "Life expectancy (years)")

p # display the graph

Or, with all layers shown in one code chunk,

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point() +
        labs(x = "GDP per capita", y = "Life expectancy (years)")

p # display the graph

Summary. The basics steps for building up the layers of any graph consist of,

  • assign the data frame
  • map variables (columns names) to aesthetic properties
  • choose geoms
  • adjust scales, labels, etc.

For more information


Your turn.

  • In the console, type ? mpg to see the data set help page. Skim the descriptions of the variables.
  • Create a scatterplot of highway miles per gallon as a function of engine displacement in liters.

Check your work:



▲ top of page

6.4 Layer: smooth fit

Suppose you wanted a smooth fit curve, not necessarily linear. Add a geom_smooth() layer. The name loess (pronounced like the proper name Lois) is a nonparametric curve-fitting method based on local regression.

p <- p + geom_smooth(method = "loess", se = FALSE)

p # display the graph

The se argument controls whether or not the confidence interval is displayed. Setting se = TRUE yields,

p <- p + geom_smooth(method = "loess", se = TRUE)

p # display the graph

For a linear-fit layer, we add a layer with method set to lm (short for linear model). The linear fit is not particularly good in this case, but now you know how to do one.

p <- p + geom_smooth(method = "lm", se = TRUE)

p # display the graph

For more information


Your turn. Continue to practice with mpg.

  • Add a loess curve fit with a confidence interval.

Check your work:



▲ top of page

6.5 Layer: log scale

We have orders of magnitude differences in the GDP per capita variable. To confirm, we can create a summary() of the gdpPercap variable. The output shows that the minimum is 241, the median 3532, and the maximum 113523.

# extract one variable from the data frame 
this_variable <- gapminder["gdpPercap"]

# statistical summary of one variable 
summary(this_variable)
#>    gdpPercap       
#>  Min.   :   241.2  
#>  1st Qu.:  1202.1  
#>  Median :  3531.8  
#>  Mean   :  7215.3  
#>  3rd Qu.:  9325.5  
#>  Max.   :113523.1

The bracket notation I just used, gapminder["gdpPercap"], is one way to extract a variable from a data frame.

In exploring a graph like this, it might be useful to add a layer that changes the horizontal scale to a log-base-10 scale.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point() +
        scale_x_log10() 

p # display the graph

The scales package allows us to change the GDP scale to dollars. Using the syntax thepackage::thefunction we can use the scales::dollar function without loading the scales package.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point() +
        scale_x_log10(labels = scales::dollar)

p # display the graph

In this case, a linear fit might work,

p <- p + geom_smooth(method = "lm", se = TRUE)

p # display the graph

Update the axis labels,

p <- p + labs(x = "GDP per capita (log10 scale)", y = "Life expectancy (years)")

p # display the graph

In summary, all the layers could have been be coded at once, for example,

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point() +
        geom_smooth(method = "lm", se = TRUE) +
        scale_x_log10(labels = scales::dollar) +
        labs(x = "GDP per capita (log10 scale)", y = "Life expectancy (years)")

With all the layers in one place, we can see that we’ve coded all the basic steps, that is,

  • assign the data frame
  • map variables (columns names) to aesthetic properties
  • choose geoms
  • adjust scales, labels, etc.

For more information


Your turn. Continue to practice with mpg.

  • Edit the axis labels to include units.

Check your work:



▲ top of page

6.6 Mapping aesthetics

So far, we have mapped variables only to the x-coordinate and y-coordinate aesthetics.

If we map a variable to the color aesthetic, the data symbols are automatically assigned different colors and a legend is created. In this example, we map the continent variable to color.

p <- ggplot(data = gapminder, 
                mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point() +
        scale_x_log10(labels = scales::dollar) + 
        labs(x = "GDP per capita", y = "Life expectancy (years)")

p # print the graph


Your turn. Continue to practice with mpg.

  • Map vehicle class to color
  • Change the curve fit to linear

Check your work:



▲ top of page

6.7 Setting properties

Because the colors overprint, we might try making the data symbols slightly transparent. In this case, we are not mapping a property to a variable; instead, we want all data symbols to be less opaque.

The alpha argument, with \(0 \leq \alpha \leq 1\), sets the transparency level. Because this change applies to all data points equally, we assign it in the geom, not aes().

p <- ggplot(data = gapminder, 
                mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
        geom_point(alpha = 0.3) +
        scale_x_log10(labels = scales::dollar) + 
        labs(x = "GDP per capita", y = "Life expectancy (years)")

p # print the graph

If we add a linear fit to these data, a fit for each continent is generated. For a thinner line, I’ve added a size argument to the geom.

p <- p + geom_smooth(method = "lm", se = FALSE, size = 0.5)

p # print the graph

If we want all the data markers the same color but we want to change the color, we don’t map it, we set it in the geom.

Here, I’ve omitted the aesthetic mapping to color and used a color assignment in the geom.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point(alpha = 0.3, color = "purple1") +
        geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "purple4") + 
        scale_x_log10(labels = scales::dollar) + 
        labs(x = "GDP per capita (log10 scale)", y = "Life expectancy (years)")

p # print the graph

For more information



▲ top of page

6.8 Layer: facets

In the earlier graph where we mapped continent to color, there was a lot of overprinting, making it difficult to compare the continents.

The facet_wrap() layer separates the data into different panels (or facets). Like the aes() mapping, facet_wrap() is applied to a variable (column name) in the data frame.

p <- p + facet_wrap(facets = vars(continent))

p # print the graph

Comparisons are facilitated by having the facets appear in one column, by using the ncols argument of facet_wrap().

p <- p + facet_wrap(facets = vars(continent), ncol = 1)

p # print the graph

In a faceted display, all panels have identical scales (the default) to facilitate comparison.

Again, all the layers could have been be coded at once, for example,

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point(alpha = 0.3, color = "purple1") +
        geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "purple4") + 
        facet_wrap(facets = vars(continent), ncol = 1) +
        scale_x_log10(labels = scales::dollar) + 
        labs(x = "GDP per capita (log10 scale)", y = "Life expectancy (years)") 

For more information


Your turn. Continue to practice with mpg.

  • Map drive type to color
  • Facet on vehicle class
  • Add some transparency to the data symbols
  • Omit the smooth fit

Check your work:



▲ top of page

6.9 Ordering the panels

The default ordering of the panels in this example is alphabetical. In most cases, ordering the panels by the data (often the mean or the median) improves the display.

Here we have two quantitative variables, but the one that is the more interesting is life expectancy. Our goal then is to order the continent variable by the median of the lifeExp variable in each panel.

To do that, we require continent to be a factor, a type of variable specialized for creating ordered levels of a category. Using glimpse() we see that continent is already a factor (<fct>).

glimpse(gapminder)
#> Observations: 1,704
#> Variables: 6
#> $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Therefore all we have to do is tell R that we want the levels of continent ordered by the median of life expectancy using the fct_reorder() function.

gapminder <- gapminder %>%
        mutate(continent = fct_reorder(continent, lifeExp, median))

In doing so, I’ve overwritten the original gapminder dataset with my revised version.

We set the as.table argument to false to place the panel with the highest life expectancy in the top position.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
        geom_point(alpha = 0.3, color = "purple1") +
        geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "purple4") + 
        facet_wrap(facets = vars(continent), ncol = 1, as.table = FALSE) + 
        scale_x_log10(labels = scales::dollar) + 
        labs(x = "GDP per capita (log10 scale)", y = "Life expectancy (years)") 

For more information


Your turn. Continue to practice with mpg.

  • Convert class to a factor ordered by the mean highway mileage
  • Same graph as before, but order the panels by mean fuel consumption

Check your work:



▲ top of page

6.10 Beyond the basics

Demonstrating how the basics can be built upon to create a complex data graphic.

To wrap up this introduction, I’ll show you how we can use functions in various layers to show all the data in every panel; add a common overall loess smooth fit; and highlight the the continent data in each panel, making it easier to compare each continent to the global data.

Because life expectancy has generally increased over time, I’m going to restrict this final graph to 2007, the most recent year in this dataset.

Typing this code in your script is optional. Without further explanation, here’s the code.

gapminder <- gapminder %>%
        filter(year == 2007)

ggplot(data = gapminder, mapping = aes(x  = gdpPercap / 1000, y = lifeExp)) +
        geom_point(data = select(gapminder, -continent),
                size  = 1.25,
                alpha = 0.5, 
                color = "#80cdc1") +
        geom_smooth(data = select(gapminder, -continent),
                method = "loess",
                se    = FALSE,
                size  = 0.7,
                color = "#80cdc1") +
        geom_point(mapping = aes(color = continent),
                size  = 1.25,
                color = "#01665e") +
        facet_wrap(vars(continent),
                ncol = 1,
                as.table = FALSE) +
        labs(x = "GDP per capita (thousands of dollars)",
                y = "Life expectancy (years)",
                title  = "Life expectancy by country, 2007",
                caption = "Source: Gapminder") +
        theme(legend.position = "none")

For more information



▲ top of page

6.11 Resize and write to file

For consistent control over the size and aspect ratio of your publication-ready graph, you should always conclude your design by saving the image and sizing it at the same time.

Here, we save the figure to the figures directory we set up earlier.

ggsave(filename = "figures/02-graph-basics-gapminder.png", 
        width  = 6.5,
        height = 10.5,
        units  = "in", 
        dpi    = 300)

And the final figure looks like this:

figures/02-graph-basics-gapminder

For more information

Your turn. Continue to practice with mpg.

  • Write your ggsave() code chunk immediately following the ggplot() code chunk of the graph you want to save.
  • Use ggsave to write your graph to the figures directory with the name 02-graph-basics-mpg.png
  • Try a 6 in by 6 in figure size

Check your work: Navigate to your figures folder. The new png file should be there. Open it to confirm it is the figure you expect.

Feel free to take a break before starting the next tutorial.



▲ top of page