Shaping MIDFIELD data three ways

Image: *Workshop* by Daniel Mee is licensed under CC BY 2.0

Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.

In this series of articles, we hope to address the needs of users who would prefer to use base R or dplyr syntax in lieu of the data.table syntax that appears everywhere else in midfieldr tutorials. We illustrate common data shaping tasks three ways: using base R, data.table, and dplyr systems.

Sources for translating among the three systems include (atrebas, 2019; Gimond, 2019; Hajnala, 2018; Mercer, 2020; and Vaughan, 2023). In particular, our outline is roughly based on Chapter 5 of (Zumel & Mount, 2020).

Getting started

Project organization. If you are writing your own scripts to follow along, our tutorials assume that you:

Are working within an RStudio Project
Have set up a directory structure with folders for data and scripts.

Packages. The table lists the packages used in the data shaping articles overall. You may wish to install these packages before starting.

Table 1: Packages in the data shaping articles
packages	used for
midfielddata	student records practice data
data.table	manipulating data
dplyr	manipulating data
tidyr	manipulating data
readr	import data
wrapr	tools for writing and debugging R code
cdata	reshaping transforms
ggplot2	creating charts

We load the packages used in an article as one of the first code chunks, as shown below. You can copy any code chunk by clicking on the “Copy to clipboard” icon .

# packages used in this article
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("midfielddata")
library("readr")
library("wrapr", warn.conflicts = FALSE)

Data

Load the datasets from midfielddata. View data dictionaries via ?student, ?term, and ?degree.

data(student, term, degree)

Have a quick look at each one using dplyr glimpse().

glimpse(student)
#> Rows: 97,555
#> Columns: 13
#> $ mcid           <chr> "MCID3111142225", "MCID3111142283", "MCID3111142290", "…
#> $ institution    <chr> "Institution B", "Institution J", "Institution J", "Ins…
#> $ transfer       <chr> "First-Time Transfer", "First-Time Transfer", "First-Ti…
#> $ hours_transfer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ race           <chr> "Asian", "Asian", "Asian", "Asian", "Asian", "Asian", "…
#> $ sex            <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male…
#> $ age_desc       <chr> "Under 25", "Under 25", "Under 25", "Under 25", "Under …
#> $ us_citizen     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
#> $ home_zip       <chr> NA, "22020", "23233", "20853", "22003", "23234", "24148…
#> $ high_school    <chr> NA, NA, "471872", NA, "470080", "471877", NA, NA, NA, N…
#> $ sat_math       <dbl> NA, 560, 510, 640, 600, 570, 480, NA, NA, NA, 520, 430,…
#> $ sat_verbal     <dbl> NA, 230, 380, 460, 500, 530, 530, NA, NA, NA, 490, 490,…
#> $ act_comp       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

glimpse(term)
#> Rows: 639,915
#> Columns: 13
#> $ mcid                <chr> "MCID3111142225", "MCID3111142283", "MCID311114228…
#> $ institution         <chr> "Institution B", "Institution J", "Institution J",…
#> $ term                <chr> "19881", "19881", "19883", "19885", "19891", "1989…
#> $ cip6                <chr> "140901", "240102", "240102", "190601", "190601", …
#> $ level               <chr> "01 First-year", "01 First-year", "01 First-year",…
#> $ standing            <chr> "Good Standing", "Academic Probation", "Academic P…
#> $ coop                <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
#> $ hours_term          <dbl> 7, 6, 12, 6, 6, 6, 6, 18, 15, 14, 13, 3, 13, 3, 3,…
#> $ hours_term_attempt  <dbl> 7, 6, 12, 6, 6, 6, 6, 18, 18, 14, 13, 3, 13, 3, 3,…
#> $ hours_cumul         <dbl> 7, 6, 18, 24, 30, 36, 42, 63, 78, 14, 27, 60, 82, …
#> $ hours_cumul_attempt <dbl> 7, 6, 18, 24, 30, 36, 42, 63, 81, 14, 27, 60, 82, …
#> $ gpa_term            <dbl> 2.56, 1.85, 1.93, 2.15, 1.85, 1.20, 1.85, 2.33, 2.…
#> $ gpa_cumul           <dbl> 2.56, 1.85, 1.90, 1.96, 1.94, 1.82, 1.82, 1.98, 2.…

glimpse(degree)
#> Rows: 49,665
#> Columns: 5
#> $ mcid        <chr> "MCID3111142225", "MCID3111142290", "MCID3111142294", "MCI…
#> $ institution <chr> "Institution B", "Institution J", "Institution J", "Instit…
#> $ term_degree <chr> "19881", "19921", "19903", "19921", "19913", "19883", "199…
#> $ cip6        <chr> "141001", "141001", "141001", "141001", "090401", "141901"…
#> $ degree      <chr> "Bachelor of Science in Electrical Engineering", "Bachelor…

Three systems

We format the data frames used in our examples three ways, once for each system: base R, data.table, and dplyr. The table shows the naming convention we use to distinguish the different forms.

Table 2: Dataset naming convention
system	suffix	example	class
base R	_df	student_df	data.frame
data.table	_dt	student_dt	data.frame, data.table
dplyr	_tbl	student_tbl	data.frame, tbl_df, tbl

Code chunks that do the same task are grouped and color coded by system, as illustrated below. Here, we prepare three versions of the student table, one per system.

# base R
student_df <- data.frame(student)

# data.table
student_dt <- copy(student)

# dplyr
student_tbl <- tibble(student)

A group of code chunks is typically followed by a comment section. For example, commenting on the code chunks above:

Comments

base R. data.frame() converts the dataset to class data.frame, removing attributes associated with the default data.table format.
data.table. copy() creates a new memory address for the new object. In data.table, if we don’t “copy”, any subsequent operations on student_dt are applied “by reference” to the original student table.
dplyr. tibble() converts the dataset to class tbl_df, removing attributes associated with the default data.table format.

Method

We want to make each operation as transparent as possible for the R novice. Hence we generally write a line of code to do one thing, avoiding combined or nested operations and introducing intermediate or temporary variables that get overwritten before reaching a desired result, e.g., DF for an intermediate data.frame, DT for an intermediate data.table, vec for an intermediate vector, etc.
In the same vein, we sometimes use intermediate variables for indexing, e.g. rows_we_want, columns_we_want, order_index, etc.
In general, we show only one approach per task per system, offering what we think is the best choice for our audience. In R there are usually multiple approaches for the same task and resources for exploring other methods are readily available.
We follow the convention of chaining dplyr steps with the magrittr pipe %>% (though assigning temporary variables would work just as well). We do not generally use the piping or chaining operations available in base R and data.table.
In each example, results are manipulated such that base R, data.table, and dplyr yield the same form (usually a data frame) with the same variables and observations. In most cases, we compare data frame equivalence with wrapr::check_equiv_frames() which converts its arguments to data.frame class and reorders columns and rows.

check_equiv_frames(student_df, student_dt)
#> [1] TRUE
check_equiv_frames(student_dt, student_tbl)
#> [1] TRUE

In cases where re-ordering rows or columns is part of the task, we compare results by using base R all.equal() instead of check_equiv_frames(). By setting the check.attributes argument to FALSE, data frame attributes (other than column names) are ignored, for example, row.names (if any) and class (e.g., tbl or data.table).

all.equal(student_df, student_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(student_dt, student_tbl, check.attributes = FALSE)
#> [1] TRUE

Download and write CSV files

In addition to the datasets from the midfielddata package, we have some prepared data to download from the Institute repository (an Internet connection is required).

Download and write a CSV file to your RStudio project data directory.

# run me first
data_url <- "https://raw.githubusercontent.com/MIDFIELDR/2024-midfield-institute/main/data/case_blocs.csv"

One of these code chunks can be copied and run in your Console window as these lines only have to be run once. We write the data file to the data directory of your R project with file paths relative to the R project main directory.

# base R
case_blocs <- read.csv(data_url)
write.csv(case_blocs, file = "data/case_blocs.csv", row.names = FALSE)

# data.table
case_blocs <- fread(data_url)
fwrite(case_blocs, file = "data/case_blocs.csv")

# readr
case_blocs <- read_csv(data_url, show_col_types = FALSE)
write_csv(case_blocs, file = "data/case_blocs.csv")

Comments

base R. write.csv() set row.names to FALSE for consistency with other two systems. All three systems expect the same first two arguments, the data frame and the file path.
data.table. fwrite() Similar to write.csv() but faster. Default row.names is FALSE.
dplyr. write_csv() also faster than write.csv(). Default row.names is FALSE.

Read CSV files

Import the file you just saved.

# base R
case_blocs_df <- read.csv("data/case_blocs.csv")

# data.table
case_blocs_dt <- fread("data/case_blocs.csv")

# readr
case_blocs_tbl <- read_csv("data/case_blocs.csv", show_col_types = FALSE)

case_blocs_dt
#>                  mcid    program peer_group
#>                <char>     <char>     <char>
#>     1: MCID3111142965 Electrical   starters
#>     2: MCID3111145102 Electrical   starters
#>     3: MCID3111150194 Industrial   starters
#>     4: MCID3111156083 Electrical   starters
#>     5: MCID3111156325 Electrical   starters
#>    ---                                     
#> 12966: MCID3112618553 Mechanical  graduates
#> 12967: MCID3112618574 Mechanical  graduates
#> 12968: MCID3112618976 Mechanical  graduates
#> 12969: MCID3112619484 Electrical  graduates
#> 12970: MCID3112641535 Mechanical  graduates

check_equiv_frames(case_blocs_df, case_blocs_dt)
#> [1] TRUE
check_equiv_frames(case_blocs_dt, case_blocs_tbl)
#> [1] TRUE

Comments

read.csv(), fread(), and read_csv() have a large number of arguments for controlling the input. Some of the most commonly encountered are colClasses (base R and data.table) or col_types (dplyr) to specify the class of a column; skip a number of lines before reading data; and header (base R and data.table) or col_names (dplyr) to use the first row of input as column names.
read.csv(), fread(), and read_csv() were also used earlier with a URL as the argument to download the supplemental data.

References

atrebas. (2019). A data.table and dplyr tour. https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

Gimond, M. (2019). base vs. tidyverse vs. data.table. https://mgimond.github.io/rug_2019_12/Index.html

Hajnala, J. (2018). Data subsetting and manipulation with base R. https://jozef.io/r002-data-manipulation/

Mercer, J. (2020). Base R, the tidyverse, and data.table: A comparison of R dialects to wrangle your data. https://tinyurl.com/yy544udn

Vaughan, D. (2023). dplyr <=> base R. https://dplyr.tidyverse.org/articles/base.html

Zumel, N., & Mount, J. (2020). Practical Data Science with R (2nd ed.). Manning Publications Co.

Appendix

The R and package version numbers (as of the most recent update) are:

R.version$version.string
#> [1] "R version 4.4.1 (2024-06-14 ucrt)"
packageVersion("midfielddata")
#> [1] '0.2.1'
packageVersion("data.table")
#> [1] '1.16.0'
packageVersion("dplyr")
#> [1] '1.1.4'
packageVersion("tidyr")
#> [1] '1.3.1'
packageVersion("readr")
#> [1] '2.1.5'
packageVersion("wrapr")
#> [1] '2.1.0'
packageVersion("cdata")
#> [1] '1.2.1'
packageVersion("ggplot2")
#> [1] '3.5.1'