packages | used for |
---|---|
midfielddata | student records practice data |
data.table | manipulating data |
dplyr | manipulating data |
tidyr | manipulating data |
readr | import data |
wrapr | tools for writing and debugging R code |
cdata | reshaping transforms |
ggplot2 | creating charts |
Shaping MIDFIELD data three ways
Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.
In this series of articles, we hope to address the needs of users who would prefer to use base R or dplyr syntax in lieu of the data.table syntax that appears everywhere else in midfieldr tutorials. We illustrate common data shaping tasks three ways: using base R, data.table, and dplyr systems.
Sources for translating among the three systems include (atrebas, 2019; Gimond, 2019; Hajnala, 2018; Mercer, 2020; and Vaughan, 2023). In particular, our outline is roughly based on Chapter 5 of (Zumel & Mount, 2020).
Getting started
Project organization. If you are writing your own scripts to follow along, our tutorials assume that you:
- Are working within an RStudio Project
- Have set up a directory structure with folders for data and scripts.
Packages. The table lists the packages used in the data shaping articles overall. You may wish to install these packages before starting.
We load the packages used in an article as one of the first code chunks, as shown below. You can copy any code chunk by clicking on the “Copy to clipboard” icon .
# packages used in this article
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("midfielddata")
library("readr")
library("wrapr", warn.conflicts = FALSE)
Data
Load the datasets from midfielddata. View data dictionaries via ?student
, ?term
, and ?degree
.
data(student, term, degree)
Have a quick look at each one using dplyr glimpse()
.
glimpse(student)
#> Rows: 97,555
#> Columns: 13
#> $ mcid <chr> "MCID3111142225", "MCID3111142283", "MCID3111142290", "…
#> $ institution <chr> "Institution B", "Institution J", "Institution J", "Ins…
#> $ transfer <chr> "First-Time Transfer", "First-Time Transfer", "First-Ti…
#> $ hours_transfer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ race <chr> "Asian", "Asian", "Asian", "Asian", "Asian", "Asian", "…
#> $ sex <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male…
#> $ age_desc <chr> "Under 25", "Under 25", "Under 25", "Under 25", "Under …
#> $ us_citizen <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
#> $ home_zip <chr> NA, "22020", "23233", "20853", "22003", "23234", "24148…
#> $ high_school <chr> NA, NA, "471872", NA, "470080", "471877", NA, NA, NA, N…
#> $ sat_math <dbl> NA, 560, 510, 640, 600, 570, 480, NA, NA, NA, 520, 430,…
#> $ sat_verbal <dbl> NA, 230, 380, 460, 500, 530, 530, NA, NA, NA, 490, 490,…
#> $ act_comp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
glimpse(term)
#> Rows: 639,915
#> Columns: 13
#> $ mcid <chr> "MCID3111142225", "MCID3111142283", "MCID311114228…
#> $ institution <chr> "Institution B", "Institution J", "Institution J",…
#> $ term <chr> "19881", "19881", "19883", "19885", "19891", "1989…
#> $ cip6 <chr> "140901", "240102", "240102", "190601", "190601", …
#> $ level <chr> "01 First-year", "01 First-year", "01 First-year",…
#> $ standing <chr> "Good Standing", "Academic Probation", "Academic P…
#> $ coop <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
#> $ hours_term <dbl> 7, 6, 12, 6, 6, 6, 6, 18, 15, 14, 13, 3, 13, 3, 3,…
#> $ hours_term_attempt <dbl> 7, 6, 12, 6, 6, 6, 6, 18, 18, 14, 13, 3, 13, 3, 3,…
#> $ hours_cumul <dbl> 7, 6, 18, 24, 30, 36, 42, 63, 78, 14, 27, 60, 82, …
#> $ hours_cumul_attempt <dbl> 7, 6, 18, 24, 30, 36, 42, 63, 81, 14, 27, 60, 82, …
#> $ gpa_term <dbl> 2.56, 1.85, 1.93, 2.15, 1.85, 1.20, 1.85, 2.33, 2.…
#> $ gpa_cumul <dbl> 2.56, 1.85, 1.90, 1.96, 1.94, 1.82, 1.82, 1.98, 2.…
glimpse(degree)
#> Rows: 49,665
#> Columns: 5
#> $ mcid <chr> "MCID3111142225", "MCID3111142290", "MCID3111142294", "MCI…
#> $ institution <chr> "Institution B", "Institution J", "Institution J", "Instit…
#> $ term_degree <chr> "19881", "19921", "19903", "19921", "19913", "19883", "199…
#> $ cip6 <chr> "141001", "141001", "141001", "141001", "090401", "141901"…
#> $ degree <chr> "Bachelor of Science in Electrical Engineering", "Bachelor…
Three systems
We format the data frames used in our examples three ways, once for each system: base R, data.table, and dplyr. The table shows the naming convention we use to distinguish the different forms.
system | suffix | example | class |
---|---|---|---|
base R | _df | student_df | data.frame |
data.table | _dt | student_dt | data.frame, data.table |
dplyr | _tbl | student_tbl | data.frame, tbl_df, tbl |
Code chunks that do the same task are grouped and color coded by system, as illustrated below. Here, we prepare three versions of the student
table, one per system.
# base R
student_df <- data.frame(student)
# data.table
student_dt <- copy(student)
# dplyr
student_tbl <- tibble(student)
A group of code chunks is typically followed by a comment section. For example, commenting on the code chunks above:
Comments
base R.
data.frame()
converts the dataset to classdata.frame
, removing attributes associated with the defaultdata.table
format.data.table.
copy()
creates a new memory address for the new object. In data.table, if we don’t “copy”, any subsequent operations onstudent_dt
are applied “by reference” to the originalstudent
table.dplyr.
tibble()
converts the dataset to classtbl_df
, removing attributes associated with the defaultdata.table
format.
Method
We want to make each operation as transparent as possible for the R novice. Hence we generally write a line of code to do one thing, avoiding combined or nested operations and introducing intermediate or temporary variables that get overwritten before reaching a desired result, e.g.,
DF
for an intermediate data.frame,DT
for an intermediate data.table,vec
for an intermediate vector, etc.In the same vein, we sometimes use intermediate variables for indexing, e.g.
rows_we_want
,columns_we_want
,order_index
, etc.In general, we show only one approach per task per system, offering what we think is the best choice for our audience. In R there are usually multiple approaches for the same task and resources for exploring other methods are readily available.
We follow the convention of chaining dplyr steps with the magrittr pipe
%>%
(though assigning temporary variables would work just as well). We do not generally use the piping or chaining operations available in base R and data.table.In each example, results are manipulated such that base R, data.table, and dplyr yield the same form (usually a data frame) with the same variables and observations. In most cases, we compare data frame equivalence with
wrapr::check_equiv_frames()
which converts its arguments to data.frame class and reorders columns and rows.
check_equiv_frames(student_df, student_dt)
#> [1] TRUE
check_equiv_frames(student_dt, student_tbl)
#> [1] TRUE
- In cases where re-ordering rows or columns is part of the task, we compare results by using base R
all.equal()
instead ofcheck_equiv_frames()
. By setting thecheck.attributes
argument to FALSE, data frame attributes (other than column names) are ignored, for example, row.names (if any) and class (e.g.,tbl
ordata.table
).
Download and write CSV files
In addition to the datasets from the midfielddata package, we have some prepared data to download from the Institute repository (an Internet connection is required).
Download and write a CSV file to your RStudio project data directory.
# run me first
data_url <- "https://raw.githubusercontent.com/MIDFIELDR/2024-midfield-institute/main/data/case_blocs.csv"
One of these code chunks can be copied and run in your Console window as these lines only have to be run once. We write the data file to the data directory of your R project with file paths relative to the R project main directory.
Comments
base R.
write.csv()
setrow.names
to FALSE for consistency with other two systems. All three systems expect the same first two arguments, the data frame and the file path.data.table.
fwrite()
Similar towrite.csv()
but faster. Defaultrow.names
is FALSE.dplyr.
write_csv()
also faster thanwrite.csv()
. Defaultrow.names
is FALSE.
Read CSV files
Import the file you just saved.
# base R
case_blocs_df <- read.csv("data/case_blocs.csv")
# data.table
case_blocs_dt <- fread("data/case_blocs.csv")
# readr
case_blocs_tbl <- read_csv("data/case_blocs.csv", show_col_types = FALSE)
case_blocs_dt
#> mcid program peer_group
#> <char> <char> <char>
#> 1: MCID3111142965 Electrical starters
#> 2: MCID3111145102 Electrical starters
#> 3: MCID3111150194 Industrial starters
#> 4: MCID3111156083 Electrical starters
#> 5: MCID3111156325 Electrical starters
#> ---
#> 12966: MCID3112618553 Mechanical graduates
#> 12967: MCID3112618574 Mechanical graduates
#> 12968: MCID3112618976 Mechanical graduates
#> 12969: MCID3112619484 Electrical graduates
#> 12970: MCID3112641535 Mechanical graduates
check_equiv_frames(case_blocs_df, case_blocs_dt)
#> [1] TRUE
check_equiv_frames(case_blocs_dt, case_blocs_tbl)
#> [1] TRUE
Comments
read.csv()
,fread()
, andread_csv()
have a large number of arguments for controlling the input. Some of the most commonly encountered arecolClasses
(base R and data.table) orcol_types
(dplyr) to specify the class of a column;skip
a number of lines before reading data; andheader
(base R and data.table) orcol_names
(dplyr) to use the first row of input as column names.read.csv()
,fread()
, andread_csv()
were also used earlier with a URL as the argument to download the supplemental data.
References
Appendix
The R and package version numbers (as of the most recent update) are:
R.version$version.string
#> [1] "R version 4.4.1 (2024-06-14 ucrt)"
packageVersion("midfielddata")
#> [1] '0.2.1'
packageVersion("data.table")
#> [1] '1.16.0'
packageVersion("dplyr")
#> [1] '1.1.4'
packageVersion("tidyr")
#> [1] '1.3.1'
packageVersion("readr")
#> [1] '2.1.5'
packageVersion("wrapr")
#> [1] '2.1.0'
packageVersion("cdata")
#> [1] '1.2.1'
packageVersion("ggplot2")
#> [1] '3.5.1'