Multi-table transforms

Image: *Merge* by Steve Walser is licensed under CC BY-NC-ND 2.0

Using base R, data.table, and dplyr to transform MIDFIELD data as needed for your analysis.

In this section, we split a data frame into a family of data frames, we concatenate tables row-wise, and we join tables.

Setup

Load packages.

library("midfielddata")
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("wrapr")

Format data as described in the Introduction.

data(student, degree)
cohort <- fread("data/three_blocs.csv")

# base R
student_df <- data.frame(student)
degree_df <- data.frame(degree)
cohort_df <- data.frame(cohort)

# data.table
student_dt <- copy(student)
degree_dt <- copy(degree)
cohort_dt <- copy(cohort)

# dplyr
student_tbl <- tibble(student)
degree_tbl <- tibble(degree)
cohort_tbl <- tibble(cohort)

Splitting tables

Splitting is an operation that separates a data frame row-wise based on values in specified columns. Splitting is an inherent part of the grouping and summarizing operations described in Aggregating transforms: Summarizing such as base R by(), data.table [, , by], and dplyr group_by().

However, splitting can be performed as a stand-alone operation as well. Base R split() creates a list of data frames and wrapr unpack() can be used to extract the data frames from the list and bind a variable name to each.

A quick look at the cohort data before splitting.

cohort
#>                  mcid    program peer_group
#>                <char>     <char>     <char>
#>     1: MCID3111142965 Electrical   starters
#>     2: MCID3111145102 Electrical   starters
#>     3: MCID3111150194 Industrial   starters
#>     4: MCID3111156083 Electrical   starters
#>     5: MCID3111156325 Electrical   starters
#>    ---                                     
#> 12966: MCID3112618553 Mechanical  graduates
#> 12967: MCID3112618574 Mechanical  graduates
#> 12968: MCID3112618976 Mechanical  graduates
#> 12969: MCID3112619484 Electrical  graduates
#> 12970: MCID3112641535 Mechanical  graduates

Task. Split the cohort data frame by peer groups.

# base R
DF_split <- split(cohort_df, cohort_df$peer_group)
start_df <- DF_split$starters
ever_df <- DF_split$ever_enrolled
grad_df <- DF_split$graduates

# data.table
DT_split <- split(cohort_dt, by = "peer_group")
wrapr::unpack(DT_split, 
              start_dt = starters,
              ever_dt = ever_enrolled,
              grad_dt = graduates)

# dplyr
TBL_split <- split(cohort_tbl, cohort_tbl$peer_group)
wrapr::unpack(TBL_split, 
              start_tbl = starters,
              ever_tbl = ever_enrolled,
              grad_tbl = graduates)

# Display/check one representative result
start_dt
#>                 mcid    program peer_group
#>               <char>     <char>     <char>
#>    1: MCID3111142965 Electrical   starters
#>    2: MCID3111145102 Electrical   starters
#>    3: MCID3111150194 Industrial   starters
#>    4: MCID3111156083 Electrical   starters
#>    5: MCID3111156325 Electrical   starters
#>   ---                                     
#> 4049: MCID3112618572 Mechanical   starters
#> 4050: MCID3112618976 Mechanical   starters
#> 4051: MCID3112619118 Electrical   starters
#> 4052: MCID3112619484 Electrical   starters
#> 4053: MCID3112619666 Mechanical   starters

check_equiv_frames(start_df, start_dt)
#> [1] TRUE
check_equiv_frames(start_dt, start_tbl)
#> [1] TRUE

Comments

base R. The data frames are extracted from the DF_split list using $.
data.table. split() has a method for data.tables using the by argument.
dplyr does not have its own splitting function, though tidyr nest() is similar—but requires additional housekeeping to unpack the resulting data frames. It seems easier in this case to use base R split() and wrapr unpack().

Confirm that the number of rows in the new tables adds up to the expected total.

(x <- nrow(start_df))
#> [1] 4053
(y <- nrow(ever_df))
#> [1] 5653
(z <- nrow(grad_df))
#> [1] 3264
sum(c(x, y, z))
#> [1] 12970
nrow(cohort)
#> [1] 12970

Appending rows

Appending rows is the inverse of splitting a data frame.

If two or more data frames have the same number of columns, with the same names, in the same order, one set of rows is easily appended to the other. We confirm that both objects are data frames and then verify the column names and positions.

Task. Append the three tables from the previous example into a single data frame (row-wise).

class(start_df)
#> [1] "data.frame"
class(ever_df)
#> [1] "data.frame"
class(grad_df)
#> [1] "data.frame"
colnames(start_df)
#> [1] "mcid"       "program"    "peer_group"
colnames(ever_df)
#> [1] "mcid"       "program"    "peer_group"
colnames(grad_df)
#> [1] "mcid"       "program"    "peer_group"

# base R
result_df <- rbind(start_df, ever_df, grad_df)

# data.table
result_dt <- rbindlist(list(start_df, ever_dt, grad_dt))

# dplyr
result_tbl <- bind_rows(start_tbl, ever_tbl, grad_tbl)

result_dt
#>                  mcid    program peer_group
#>                <char>     <char>     <char>
#>     1: MCID3111142965 Electrical   starters
#>     2: MCID3111145102 Electrical   starters
#>     3: MCID3111150194 Industrial   starters
#>     4: MCID3111156083 Electrical   starters
#>     5: MCID3111156325 Electrical   starters
#>    ---                                     
#> 12966: MCID3112618553 Mechanical  graduates
#> 12967: MCID3112618574 Mechanical  graduates
#> 12968: MCID3112618976 Mechanical  graduates
#> 12969: MCID3112619484 Electrical  graduates
#> 12970: MCID3112641535 Mechanical  graduates
 
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. rbind() acts on the data frames listed.
data.table. rbindlist(), argument must be a list of data frames.
dplyr. bind_rows() acts on the data frames listed. Can be a list of data frames.

Confirm that the reassembled data frame is equivalent to the original.

check_equiv_frames(cohort, result_dt)
#> [1] TRUE

Appending columns

Columns can be appended to a data frame using cbind() and bind_cols()—the column-wise equivalents to rbind() and bind_rows()—but we rarely use them. Instead, appending new columns is nearly always accomplished in our work using joins. The most common joins in midfieldr work are the left join and the inner join, described next.

Left join

A left join is a merge operation between two data frames that returns all observations (rows) of the “left” data frame x and all the matching rows in the “right” data frame y.

The general forms of the left join are:

  # base R
  merge(x, y, by, all.x = TRUE)

  # data.table
  y[x, j, on]

  # dplyr
  left_join(x, y, by)

where

x is the “left” data frame, all rows returned
y is the “right” data frame, matching rows returned
by and on are the vectors of shared column names to merge by
j (if used) selects columns from the joined data frame to retain

Task. Join student race and sex to the data frame of graduates by ID.

We use intermediate variables x and y in each system to help clarify the join operation.

# base R
x <- grad_df[, c("mcid", "program"), drop = FALSE]
y <- student_df[, c("mcid", "race", "sex"), drop = FALSE]
result_df <- merge(x, y, by = c("mcid"), all.x = TRUE)

# data.table
x <- grad_dt[, .(mcid, program)]
y <- student_dt[, .(mcid, race, sex)]
result_dt <- y[x, , on = c("mcid")]

# dplyr
x <- select(grad_tbl, mcid, program)
y <- select(student_tbl, mcid, race, sex)
result_tbl <- left_join(x, y, by = c("mcid"))

result_dt
#>                 mcid          race    sex    program
#>               <char>        <char> <char>     <char>
#>    1: MCID3111142965 International   Male Electrical
#>    2: MCID3111145102         White   Male Electrical
#>    3: MCID3111146537         Asian Female Electrical
#>    4: MCID3111146674         Asian   Male Electrical
#>    5: MCID3111150194         Black   Male Industrial
#>   ---                                               
#> 3260: MCID3112618553 International   Male Mechanical
#> 3261: MCID3112618574 International   Male Mechanical
#> 3262: MCID3112618976         White   Male Mechanical
#> 3263: MCID3112619484         White   Male Electrical
#> 3264: MCID3112641535         White   Male Mechanical

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R.
data.table.
dplyr.

Inner join

An inner join is a merge operation between two data frames that returns all observations (rows) from two data frames that match specified conditions in both.

The general forms of the inner join are:

  # base R
  merge(x, y, by, all = FALSE)

  # data.table
  y[x, j, on, nomatch = NULL]

  # dplyr
  inner_join(x, y, by)

where

x is a data frame, matching rows returned
y is a data frame, matching rows returned
by and on are the vectors of shared column names to merge by
j (if used) selects columns from the joined data frame to retain

Task. Join the CIP column from the degree data to the ever-enrolled dataset by ID, retaining only those students who completed a degree, i.e., IDs common to both data sets (inner join).

# base R
x <- ever_df[, c("mcid"), drop = FALSE]
x <- unique(x)
y <- degree_df[, c("mcid", "cip6"), drop = FALSE]
names(y)[names(y) == "cip6"] <- "degree_CIP"
result_df <- merge(x, y, by = c("mcid"), all = FALSE)

# data.table
x <- ever_dt[, .(mcid)]
x <- unique(x)
y <- degree_dt[, .(mcid, degree_CIP = cip6)]
result_dt <- y[x, , on = "mcid", nomatch = NULL]

# dplyr
x <- select(ever_tbl, mcid) %>%
  distinct()
y <- select(degree_tbl, mcid, cip6) %>%
  rename(degree_CIP = cip6)
result_tbl <- inner_join(x, y, by = join_by(mcid))

result_dt
#>                 mcid degree_CIP
#>               <char>     <char>
#>    1: MCID3111142965     141001
#>    2: MCID3111145102     141001
#>    3: MCID3111146537     141001
#>    4: MCID3111146674     141001
#>    5: MCID3111150194     143501
#>   ---                          
#> 4194: MCID3112618976     141901
#> 4195: MCID3112619484     141001
#> 4196: MCID3112641399     270301
#> 4197: MCID3112641535     141901
#> 4198: MCID3112698681     110701

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R.
data.table.
dplyr.

Setup

Splitting tables

Appending rows

Appending columns

Left join

Inner join

References