library("midfielddata")
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("wrapr")
Multi-table transforms
Using base R, data.table, and dplyr to transform MIDFIELD data as needed for your analysis.
In this section, we split a data frame into a family of data frames, we concatenate tables row-wise, and we join tables.
Setup
Load packages.
Format data as described in the Introduction.
# base R
student_df <- data.frame(student)
degree_df <- data.frame(degree)
cohort_df <- data.frame(cohort)
Splitting tables
Splitting is an operation that separates a data frame row-wise based on values in specified columns. Splitting is an inherent part of the grouping and summarizing operations described in Aggregating transforms: Summarizing such as base R by()
, data.table [, , by]
, and dplyr group_by()
.
However, splitting can be performed as a stand-alone operation as well. Base R split()
creates a list of data frames and wrapr unpack()
can be used to extract the data frames from the list and bind a variable name to each.
A quick look at the cohort data before splitting.
cohort
#> mcid program peer_group
#> <char> <char> <char>
#> 1: MCID3111142965 Electrical starters
#> 2: MCID3111145102 Electrical starters
#> 3: MCID3111150194 Industrial starters
#> 4: MCID3111156083 Electrical starters
#> 5: MCID3111156325 Electrical starters
#> ---
#> 12966: MCID3112618553 Mechanical graduates
#> 12967: MCID3112618574 Mechanical graduates
#> 12968: MCID3112618976 Mechanical graduates
#> 12969: MCID3112619484 Electrical graduates
#> 12970: MCID3112641535 Mechanical graduates
Task. Split the cohort
data frame by peer groups.
# base R
DF_split <- split(cohort_df, cohort_df$peer_group)
start_df <- DF_split$starters
ever_df <- DF_split$ever_enrolled
grad_df <- DF_split$graduates
# Display/check one representative result
start_dt
#> mcid program peer_group
#> <char> <char> <char>
#> 1: MCID3111142965 Electrical starters
#> 2: MCID3111145102 Electrical starters
#> 3: MCID3111150194 Industrial starters
#> 4: MCID3111156083 Electrical starters
#> 5: MCID3111156325 Electrical starters
#> ---
#> 4049: MCID3112618572 Mechanical starters
#> 4050: MCID3112618976 Mechanical starters
#> 4051: MCID3112619118 Electrical starters
#> 4052: MCID3112619484 Electrical starters
#> 4053: MCID3112619666 Mechanical starters
check_equiv_frames(start_df, start_dt)
#> [1] TRUE
check_equiv_frames(start_dt, start_tbl)
#> [1] TRUE
Comments
base R. The data frames are extracted from the
DF_split
list using$
.data.table.
split()
has a method for data.tables using theby
argument.dplyr does not have its own splitting function, though tidyr
nest()
is similar—but requires additional housekeeping to unpack the resulting data frames. It seems easier in this case to use base Rsplit()
and wraprunpack()
.
Confirm that the number of rows in the new tables adds up to the expected total.
Appending rows
Appending rows is the inverse of splitting a data frame.
If two or more data frames have the same number of columns, with the same names, in the same order, one set of rows is easily appended to the other. We confirm that both objects are data frames and then verify the column names and positions.
Task. Append the three tables from the previous example into a single data frame (row-wise).
# base R
result_df <- rbind(start_df, ever_df, grad_df)
# dplyr
result_tbl <- bind_rows(start_tbl, ever_tbl, grad_tbl)
result_dt
#> mcid program peer_group
#> <char> <char> <char>
#> 1: MCID3111142965 Electrical starters
#> 2: MCID3111145102 Electrical starters
#> 3: MCID3111150194 Industrial starters
#> 4: MCID3111156083 Electrical starters
#> 5: MCID3111156325 Electrical starters
#> ---
#> 12966: MCID3112618553 Mechanical graduates
#> 12967: MCID3112618574 Mechanical graduates
#> 12968: MCID3112618976 Mechanical graduates
#> 12969: MCID3112619484 Electrical graduates
#> 12970: MCID3112641535 Mechanical graduates
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R.
rbind()
acts on the data frames listed.data.table.
rbindlist()
, argument must be a list of data frames.dplyr.
bind_rows()
acts on the data frames listed. Can be a list of data frames.
Confirm that the reassembled data frame is equivalent to the original.
check_equiv_frames(cohort, result_dt)
#> [1] TRUE
Appending columns
Columns can be appended to a data frame using cbind()
and bind_cols()
—the column-wise equivalents to rbind()
and bind_rows()
—but we rarely use them. Instead, appending new columns is nearly always accomplished in our work using joins. The most common joins in midfieldr work are the left join and the inner join, described next.
Left join
A left join is a merge operation between two data frames that returns all observations (rows) of the “left” data frame x
and all the matching rows in the “right” data frame y
.
The general forms of the left join are:
where
-
x
is the “left” data frame, all rows returned -
y
is the “right” data frame, matching rows returned -
by
andon
are the vectors of shared column names to merge by -
j
(if used) selects columns from the joined data frame to retain
Task. Join student race and sex to the data frame of graduates by ID.
We use intermediate variables x
and y
in each system to help clarify the join operation.
# data.table
x <- grad_dt[, .(mcid, program)]
y <- student_dt[, .(mcid, race, sex)]
result_dt <- y[x, , on = c("mcid")]
result_dt
#> mcid race sex program
#> <char> <char> <char> <char>
#> 1: MCID3111142965 International Male Electrical
#> 2: MCID3111145102 White Male Electrical
#> 3: MCID3111146537 Asian Female Electrical
#> 4: MCID3111146674 Asian Male Electrical
#> 5: MCID3111150194 Black Male Industrial
#> ---
#> 3260: MCID3112618553 International Male Mechanical
#> 3261: MCID3112618574 International Male Mechanical
#> 3262: MCID3112618976 White Male Mechanical
#> 3263: MCID3112619484 White Male Electrical
#> 3264: MCID3112641535 White Male Mechanical
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R.
data.table.
dplyr.
Inner join
An inner join is a merge operation between two data frames that returns all observations (rows) from two data frames that match specified conditions in both.
The general forms of the inner join are:
# base R
merge(x, y, by, all = FALSE)
# data.table
y[x, j, on, nomatch = NULL]
# dplyr
inner_join(x, y, by)
where
-
x
is a data frame, matching rows returned -
y
is a data frame, matching rows returned -
by
andon
are the vectors of shared column names to merge by -
j
(if used) selects columns from the joined data frame to retain
Task. Join the CIP column from the degree
data to the ever-enrolled dataset by ID, retaining only those students who completed a degree, i.e., IDs common to both data sets (inner join).
# data.table
x <- ever_dt[, .(mcid)]
x <- unique(x)
y <- degree_dt[, .(mcid, degree_CIP = cip6)]
result_dt <- y[x, , on = "mcid", nomatch = NULL]
result_dt
#> mcid degree_CIP
#> <char> <char>
#> 1: MCID3111142965 141001
#> 2: MCID3111145102 141001
#> 3: MCID3111146537 141001
#> 4: MCID3111146674 141001
#> 5: MCID3111150194 143501
#> ---
#> 4194: MCID3112618976 141901
#> 4195: MCID3112619484 141001
#> 4196: MCID3112641399 270301
#> 4197: MCID3112641535 141901
#> 4198: MCID3112698681 110701
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R.
data.table.
dplyr.