R data basics

Using data.table to manipulate data in R.

License. This material is adapted from Getting started in R: Tinyverse edition by Bashir and Eddelbuettel (2018) which was licensed under CC BY-SA by ilustat. This adaptation and extension, R data basics by Richard Layton, is licensed under CC BY-SA 2.0.

Preface

The rationale for manipulating data using the data.table package instead of base R or the dplyr package include speed, memory efficiency, and concise transforms via the []-indexing operator (e.g., the “by” grouping notation).

To mitigate the extra cognitive load for new R users of learning data.table syntax while also learning R, I use a “no-frills” approach as much as possible.

Conceptually, data.table operations can be viewed as

    dt[i, j, by]

with some intentional similarity to SQL. Here i operates on rows, j operates on columns, and by is the grouping operator.

Before starting. Our tutorials assume that you

  • Have completed the Before you arrive instructions
  • Have completed the R basics guide
  • Have completed the R chart basics guide
  • Start your R session by launching the RStudio project you created, e.g., midfield-institute-2022.Rproj

If you are in an RStudio project, the project name appears in the upper left corner of the RStudio window. Your project directory (folder) should look something like this:

midfield-institute-2022\
    data\
    results\
    scripts\
    midfield-institute-2022.Rproj

Getting started

Start a new script. I recommend that you start a new script for this segment.

  • Use File > New File > R Script to create a new R script
  • File > Save As… to name the file (we suggest 03-data-basics.R), then save it to the scripts directory
  • At the top of the script, add a minimal header, something like:
# R data basics
# your name  
# date 
  • Use library() to load packages used in the script.
# silently load packages
suppressPackageStartupMessages(library("data.table"))

# set data.table printout options
options(
  datatable.print.nrows = 20,
  datatable.print.topn = 5,
  datatable.print.class = TRUE
)

Reminder: In a code chunk like the one above, you can click on the “Copy to clipboard” icon in the upper right corner to enable quick copy and paste from this document to your script.

Use your script throughout the tutorial. When a new chunk of code is given,

  • Copy the line(s) of code into your script, save, and run.
  • Check your result by comparing it to the result in the tutorial.
  • Check what you’ve learned using the Your turn exercises.

Import data

Read the same data file we used in the previous tutorial. The data set is in block-record form, that is, variables are columns and observations are rows.

# import data
cw <- fread("data/chick_weight.csv")

# view
cw
#>      Chick  Diet  Time weight
#>      <int> <int> <int>  <int>
#>   1:    18     1     0     39
#>   2:    18     1     2     35
#>   3:    16     1     0     41
#>   4:    16     1     2     45
#>   5:    16     1     4     49
#>  ---                         
#> 574:    48     4    14    170
#> 575:    48     4    16    222
#> 576:    48     4    18    261
#> 577:    48     4    20    303
#> 578:    48     4    21    322

Changes in place

Rename variables. Edit column names “in place.”

# change column names in place
setnames(cw, 
         old = c("Chick", "Diet", "Time", "weight"), 
         new = c("chick_id", "group", "day", "weight_g"))

cw
#>      chick_id group   day weight_g
#>         <int> <int> <int>    <int>
#>   1:       18     1     0       39
#>   2:       18     1     2       35
#>   3:       16     1     0       41
#>   4:       16     1     2       45
#>   5:       16     1     4       49
#>  ---                              
#> 574:       48     4    14      170
#> 575:       48     4    16      222
#> 576:       48     4    18      261
#> 577:       48     4    20      303
#> 578:       48     4    21      322

By in place we mean that the change is made using data.table reference semantics (or by reference) in which changes are made without assigning the result back to a name, that is, we wrote

    setnames(cw, old, new)

not

    cw <- setnames(cw, old, new)

There’s much more to reference semantics than dropping the assignment operator—it’s really about speed and memory efficiency with large data sets. However, for our “basics” series, you need to recall only that the := operator and all data.table functions with names starting with set work by reference.

Reorder variables. Another example of an operation in-place. The columns listed are moved to the left-hand side of the data frame and other columns appear to the right in their original order.

# reorder column positions in place
setcolorder(cw, c("day", "group"))

# view
cw
#>        day group chick_id weight_g
#>      <int> <int>    <int>    <int>
#>   1:     0     1       18       39
#>   2:     2     1       18       35
#>   3:     0     1       16       41
#>   4:     2     1       16       45
#>   5:     4     1       16       49
#>  ---                              
#> 574:    14     4       48      170
#> 575:    16     4       48      222
#> 576:    18     4       48      261
#> 577:    20     4       48      303
#> 578:    21     4       48      322

If you want to make a change without affecting the source data frame, assign a copy of the data frame to a new name with the copy() function. Assignment to a new name is not enough; you have to deliberately use the copy() function.

# create an new object in memory
x <- copy(cw)

# edit in place
setcolorder(x, c("group", "day"))

# view
x
#>      group   day chick_id weight_g
#>      <int> <int>    <int>    <int>
#>   1:     1     0       18       39
#>   2:     1     2       18       35
#>   3:     1     0       16       41
#>   4:     1     2       16       45
#>   5:     1     4       16       49
#>  ---                              
#> 574:     4    14       48      170
#> 575:     4    16       48      222
#> 576:     4    18       48      261
#> 577:     4    20       48      303
#> 578:     4    21       48      322

# source remains unchanged
cw
#>        day group chick_id weight_g
#>      <int> <int>    <int>    <int>
#>   1:     0     1       18       39
#>   2:     2     1       18       35
#>   3:     0     1       16       41
#>   4:     2     1       16       45
#>   5:     4     1       16       49
#>  ---                              
#> 574:    14     4       48      170
#> 575:    16     4       48      222
#> 576:    18     4       48      261
#> 577:    20     4       48      303
#> 578:    21     4       48      322

Column operations

j to transform a variable. Use := to transform column values in place, here, converting an integer to a double-precision number.

# modify values in an existing column
cw[, weight_g := as.numeric(weight_g)]

# view
cw
#>        day group chick_id weight_g
#>      <int> <int>    <int>    <num>
#>   1:     0     1       18       39
#>   2:     2     1       18       35
#>   3:     0     1       16       41
#>   4:     2     1       16       45
#>   5:     4     1       16       49
#>  ---                              
#> 574:    14     4       48      170
#> 575:    16     4       48      222
#> 576:    18     4       48      261
#> 577:    20     4       48      303
#> 578:    21     4       48      322

Here we use paste0() to add a prefix to the group number, converting the variable to a character type at the same time.

# modify values in an existing column
cw[, group := paste0("diet ", group)]

# view
cw
#>        day  group chick_id weight_g
#>      <int> <char>    <int>    <num>
#>   1:     0 diet 1       18       39
#>   2:     2 diet 1       18       35
#>   3:     0 diet 1       16       41
#>   4:     2 diet 1       16       45
#>   5:     4 diet 1       16       49
#>  ---                               
#> 574:    14 diet 4       48      170
#> 575:    16 diet 4       48      222
#> 576:    18 diet 4       48      261
#> 577:    20 diet 4       48      303
#> 578:    21 diet 4       48      322

j to select variables. Variables not listed are dropped.

# select columns 
x <- cw[, .(day, group, weight_g)]

# view
x
#>        day  group weight_g
#>      <int> <char>    <num>
#>   1:     0 diet 1       39
#>   2:     2 diet 1       35
#>   3:     0 diet 1       41
#>   4:     2 diet 1       45
#>   5:     4 diet 1       49
#>  ---                      
#> 574:    14 diet 4      170
#> 575:    16 diet 4      222
#> 576:    18 diet 4      261
#> 577:    20 diet 4      303
#> 578:    21 diet 4      322

j to add a variable. The new variable is added to the right of the other columns.

# add a column
cw[, weight_kg := weight_g / 1000]

# view
cw
#>        day  group chick_id weight_g weight_kg
#>      <int> <char>    <int>    <num>     <num>
#>   1:     0 diet 1       18       39     0.039
#>   2:     2 diet 1       18       35     0.035
#>   3:     0 diet 1       16       41     0.041
#>   4:     2 diet 1       16       45     0.045
#>   5:     4 diet 1       16       49     0.049
#>  ---                                         
#> 574:    14 diet 4       48      170     0.170
#> 575:    16 diet 4       48      222     0.222
#> 576:    18 diet 4       48      261     0.261
#> 577:    20 diet 4       48      303     0.303
#> 578:    21 diet 4       48      322     0.322

j to delete a variable.

# delete a column
cw[, weight_kg := NULL]

# view
cw
#>        day  group chick_id weight_g
#>      <int> <char>    <int>    <num>
#>   1:     0 diet 1       18       39
#>   2:     2 diet 1       18       35
#>   3:     0 diet 1       16       41
#>   4:     2 diet 1       16       45
#>   5:     4 diet 1       16       49
#>  ---                               
#> 574:    14 diet 4       48      170
#> 575:    16 diet 4       48      222
#> 576:    18 diet 4       48      261
#> 577:    20 diet 4       48      303
#> 578:    21 diet 4       48      322

Row operations

i to filter by values.

# subset by observation
y <- cw[day == 21 & weight_g > 300]

# view
y
#>      day  group chick_id weight_g
#>    <int> <char>    <int>    <num>
#> 1:    21 diet 1        7      305
#> 2:    21 diet 2       29      309
#> 3:    21 diet 2       21      331
#> 4:    21 diet 3       32      305
#> 5:    21 diet 3       40      321
#> 6:    21 diet 3       34      341
#> 7:    21 diet 3       35      373
#> 8:    21 diet 4       48      322

For comparing values in vectors use: < (less than), > (greater than), <= (less than or equal to), >= (greater than or equal to), == (equal to), and != (not equal to). These can be combined logically using & (and) and | (or).

i to filter by index.

# subset by position
y <- cw[1:10]

# view
y
#>       day  group chick_id weight_g
#>     <int> <char>    <int>    <num>
#>  1:     0 diet 1       18       39
#>  2:     2 diet 1       18       35
#>  3:     0 diet 1       16       41
#>  4:     2 diet 1       16       45
#>  5:     4 diet 1       16       49
#>  6:     6 diet 1       16       51
#>  7:     8 diet 1       16       57
#>  8:    10 diet 1       16       51
#>  9:    12 diet 1       16       54
#> 10:     0 diet 1       15       41

i to arrange rows.

# order rows by values in columns
y <- cw[order(weight_g, day)]

# view
y
#>        day  group chick_id weight_g
#>      <int> <char>    <int>    <num>
#>   1:     2 diet 1       18       35
#>   2:     0 diet 1       18       39
#>   3:     0 diet 2       27       39
#>   4:     0 diet 2       28       39
#>   5:     0 diet 2       29       39
#>  ---                               
#> 574:    21 diet 2       21      331
#> 575:    18 diet 3       35      332
#> 576:    21 diet 3       34      341
#> 577:    20 diet 3       35      361
#> 578:    21 diet 3       35      373

Group and summarize

by to count. .N is a data.table function that yields the number of observations (rows) in a group identified with the by argument.

# count number of chicks by diet group
y <- cw[, .(N_chicks = .N), by = c("group")]

# view
y
#>     group N_chicks
#>    <char>    <int>
#> 1: diet 1      220
#> 2: diet 2      120
#> 3: diet 3      120
#> 4: diet 4      118

by to summarize. Yields three new variables for count, mean, and standard deviation by the two grouping variables group and day.

# compute summary statistics grouped by two variables
y <- cw[,                                         # i (empty)
        .(N_chicks = .N,                          # j
          mean_wt_g = round(mean(weight_g), 1),   # J (cont.)
          std_dev_wt_g = round(sd(weight_g), 1)), # j (cont.)
        by = c("group", "day")]                   # by

# view
y
#>      group   day N_chicks mean_wt_g std_dev_wt_g
#>     <char> <int>    <int>     <num>        <num>
#>  1: diet 1     0       20      41.4          1.0
#>  2: diet 1     2       20      47.2          4.3
#>  3: diet 1     4       19      56.5          4.1
#>  4: diet 1     6       19      66.8          7.8
#>  5: diet 1     8       19      79.7         13.8
#> ---                                             
#> 44: diet 4    14       10     161.8         15.7
#> 45: diet 4    16       10     182.0         25.3
#> 46: diet 4    18       10     202.9         33.6
#> 47: diet 4    20        9     233.9         37.6
#> 48: diet 4    21        9     238.6         43.3

Combine operations

[i, j, by] in combination. In this example, we use i to filter for two specific days, j to create new summary variables grouped by variables group and day. The results are the count N plus weight statistics: mean, standard deviation, median, minimum, and maximum.

# grouped summaries of specific observations
cw_stats <- cw[day %in% c(0, 21),                         # i
               .(N = .N,                                  # j
                 mean_wt_g    = round(mean(weight_g), 0), # j (cont.)
                 std_dev_wt_g = round(sd(weight_g), 1),   # j (cont.)
                 median_wt_g  = median(weight_g),         # j (cont.)
                 min_wt_g     = min(weight_g),            # j (cont.)
                 max_wt_g     = max(weight_g)),           # j (cont.)
               by = .(group, day)]                        # by

# view
cw_stats
#>     group   day     N mean_wt_g std_dev_wt_g median_wt_g min_wt_g max_wt_g
#>    <char> <int> <int>     <num>        <num>       <num>    <num>    <num>
#> 1: diet 1     0    20        41          1.0        41.0       39       43
#> 2: diet 1    21    16       178         58.7       166.0       96      305
#> 3: diet 2     0    10        41          1.5        40.5       39       43
#> 4: diet 2    21    10       215         78.1       212.5       74      331
#> 5: diet 3     0    10        41          1.0        41.0       39       42
#> 6: diet 3    21    10       270         71.6       281.0      147      373
#> 7: diet 4     0    10        41          1.1        41.0       39       42
#> 8: diet 4    21     9       239         43.3       237.0      196      322

Create a table

Finally, we can make the summaries “prettier” for a possible report or publication where we format the numeric values as text and edit the column names for readability just before printing.

# combine mean and std dev into a single character column
cw_stats[, mean_sd := paste0(mean_wt_g, " (", std_dev_wt_g, ")")]

# combine min and max into a range column
cw_stats[, range := paste0(min_wt_g, "--", max_wt_g)]

# retain specific columns
cw_stats <- cw_stats[ , .(group, day, N, mean_sd, median_wt_g, range)]

# order the rows
cw_stats <- cw_stats[order(group, day)]

# edit the column names for readability 
setnames(cw_stats,
         old = c("group", "day", "median_wt_g","mean_sd", "range"),
         new = c("Group", "Day", "Median", "Mean [SD]", "Range"))

# view
cw_stats
#>     Group   Day     N  Mean [SD] Median    Range
#>    <char> <int> <int>     <char>  <num>   <char>
#> 1: diet 1     0    20     41 (1)   41.0   39--43
#> 2: diet 1    21    16 178 (58.7)  166.0  96--305
#> 3: diet 2     0    10   41 (1.5)   40.5   39--43
#> 4: diet 2    21    10 215 (78.1)  212.5  74--331
#> 5: diet 3     0    10     41 (1)   41.0   39--42
#> 6: diet 3    21    10 270 (71.6)  281.0 147--373
#> 7: diet 4     0    10   41 (1.1)   41.0   39--42
#> 8: diet 4    21     9 239 (43.3)  237.0 196--322

Eventually you should be able to produce a publication-ready version such as the following table, using Rmarkdown and the knitr package. Full details are available in other sources.

Table 1. Chick weight summary statistics (grams).
Group Day N Mean [SD] Median Range
diet 1 0 20 41 (1) 41.0 39–43
diet 1 21 16 178 (58.7) 166.0 96–305
diet 2 0 10 41 (1.5) 40.5 39–43
diet 2 21 10 215 (78.1) 212.5 74–331
diet 3 0 10 41 (1) 41.0 39–42
diet 3 21 10 270 (71.6) 281.0 147–373
diet 4 0 10 41 (1.1) 41.0 39–42
diet 4 21 9 239 (43.3) 237.0 196–322

Interpretation. This summary table offers similar results to those revealed in the charts,

  • in day 21, diet 3 has the highest mean and median weights
  • in day 21, diet 4 has the least variability in weight

In addition, we note that by day 21, diet 1 lost 4 chicks of 20 that started and diet 4 lost 1 of 10. This could be a sign of unsafe conditions.

Conclusion

The R basics, R chart basics, and R data basics guides introduced you to some of the basic concepts for using R and the data.table and ggplot2 packages to manipulate data and create charts. It is only an introduction to what R can do but hopefully you have seen some of the potential of using R for data science.

Next steps. There are plenty of R courses, books, and online resources that you can learn from. Find things that work for you and don’t be afraid to make mistakes or ask questions.

During the MIDFIELD workshops, you will build on what you’ve learned here, and use the midfieldr package to study student unit records. You can preview that material on the midfieldr website.

References

Bashir, S., & Eddelbuettel, D. (2018). Getting started in R: Tinyverse edition. https://eddelbuettel.github.io/gsir-te/Getting-Started-in-R.pdf
Dowle, M., & Srinivasan, A. (2021). Data.table: Extension of ‘data.frame‘. https://CRAN.R-project.org/package=data.table