Going further

Image: From the bottom uo by Jusitn Kern is licensed under CC BY-NC-ND 2.0

Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.

In this section we demonstrate …

Setup

Load packages.

Load datasets from midfielddata.

data(student, term, degree)

Recode strings in degree for output readability as described in Basic Transforms.

degree[, degree := gsub(" of", "", degree)]
degree[, degree := sub("Bachelor Science", "BS", degree)]
degree[, degree := sub("Bachelor Arts", "BA", degree)]
degree[, degree := sub("Bachelor Fine Arts", "BFA", degree)]

Format data as described in the Introduction.

# base R
student_df <- data.frame(student)
term_df <- data.frame(term)
degree_df <- data.frame(degree)
# data.table
student_dt <- copy(student)
term_dt <- copy(term)
degree_dt <- copy(degree)
# dplyr
student_tbl <- tibble(student)
term_tbl <- tibble(term)
degree_tbl <- tibble(degree)


Advanced row-subsetting by groups

When a specific instance yields a block of rows

When subsetting rows for a specific instance of one variable (first, last, etc.) by group or key variable, we may need to allow for the possible return of more than one row per key.

For example, from the degree data table, subsetting for the first degree-term for the following student should return two rows, one each for Electrical Engineering and Mechanical Engineering.

id_1 <- "MCID3112353622"
degree[mcid == id_1]
#>              mcid   institution term_degree   cip6                       degree
#>            <char>        <char>      <char> <char>                       <char>
#> 1: MCID3112353622 Institution C       20133 141001 BS in Electrical Engineering
#> 2: MCID3112353622 Institution C       20133 141901 BS in Mechanical Engineering

In contrast, subsetting for the first term by ID for this next student should return one row for their degree in Economics (Summer 2003) and drop the row for their degree in History (Fall 2006).

id_2 <- "MCID3111600469"
degree[mcid == id_2]
#>              mcid   institution term_degree   cip6          degree
#>            <char>        <char>      <char> <char>          <char>
#> 1: MCID3111600469 Institution C       20024 450601 BA in Economics
#> 2: MCID3111600469 Institution C       20061 540101   BA in History

Task. Retain all rows of the earliest term in which students earn one or more degrees.

key_columns <- c("mcid", "term_degree")
columns_we_want <- c(key_columns, "degree")
# base R
DF <- degree_df[, key_columns, drop = FALSE]
order_index <- with(DF, order(mcid, term_degree))
DF <- DF[order_index, , drop = FALSE]
DF_by <- by(DF, DF$mcid, function(x) 
            data.frame(mcid = unique(x$mcid),
                       term_degree = x$term_degree[1]))
DF_combine <- do.call(rbind, DF_by)
DF <- degree_df[, columns_we_want, drop = FALSE]
result_df <- merge(DF_combine, DF, by = key_columns, all.x = TRUE)
rownames(result_df) <- NULL
# data.table
DT <- degree_dt[, ..columns_we_want]
setkeyv(DT, key_columns)
result_dt <- DT[, .SD[term_degree == term_degree[1]], by = c("mcid")]
setkey(result_dt, NULL)
# dplyr
result_tbl <- degree_tbl %>%
  select(all_of(columns_we_want)) %>% 
  arrange(mcid, term_degree) %>%
  group_by(mcid) %>%
  filter(term_degree == term_degree[1]) %>%
  ungroup()
result_dt
#>                  mcid term_degree                                  degree
#>                <char>      <char>                                  <char>
#>     1: MCID3111142225       19881            BS in Electrical Engineering
#>     2: MCID3111142290       19921            BS in Electrical Engineering
#>     3: MCID3111142294       19903            BS in Electrical Engineering
#>     4: MCID3111142299       19921            BS in Electrical Engineering
#>     5: MCID3111142689       19913                        BA in Journalism
#>    ---                                                                   
#> 49614: MCID3112829602       20173  BA in Political Science and Government
#> 49615: MCID3112831015       20181                         BA in Geography
#> 49616: MCID3112839623       20181                       BS in Linguistics
#> 49617: MCID3112845220       20181                       BS in Mathematics
#> 49618: MCID3112845673       20174 BS in Speech Communication and Rhetoric

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R. Order rows by ID and term; use by() for a split-apply-combine strategy to create a data frame with two columns: ID and first degree-term; and left-join using merge() to recover the matching degree data, including rows with more than one degree per student ID.

  • data.table. setkeyv() orders the rows and .SD[...] subsets for the earliest term by ID, including any multiple rows.

  • dplyr. arrange() orders the rows, group_by() sets ID as the key, and filter() subsets for the earliest term, including any multiple rows.

Check that we subset the relevant rows. The result for id_1 (from the earlier example) correctly includes two degrees in the same term.

result_dt[mcid == id_1]
#>              mcid term_degree                       degree
#>            <char>      <char>                       <char>
#> 1: MCID3112353622       20133 BS in Electrical Engineering
#> 2: MCID3112353622       20133 BS in Mechanical Engineering

The result for id_2correctly includes their first degree in Economics and drops their subsequent degree in History.

result_dt[mcid == id_2]
#>              mcid term_degree          degree
#>            <char>      <char>          <char>
#> 1: MCID3111600469       20024 BA in Economics

When a specific instance yields one row

In contrast to the previous case, there are times when we expect subsetting for a specific instance by group to yield one row only.

For example, suppose we want to identify the first term in which a student’s cumulative GPA dropped below 2.0. By definition, the result should yield only one row (or none) per ID.

columns_we_want <- c("mcid", "term", "standing", "gpa_cumul")
key_columns <- c("mcid", "term")
# base R
rows_we_want <- term_df$gpa_cumul < 2.0
DF <- term_df[rows_we_want, columns_we_want, drop = FALSE]
order_index <- with(DF, order(mcid, term))
DF <- DF[order_index, , drop = FALSE]
DF_by <- by(DF, DF$mcid, function(x) 
            data.frame(mcid = x$mcid[1],
                       term = x$term[1],
                       standing = x$standing[1],
                       gpa_cumul = x$gpa_cumul[1]))
result_df <- do.call(rbind, DF_by)
rownames(result_df) <- NULL
# data.table
rows_we_want <- term_dt$gpa_cumul < 2.0
DT <- term_dt[rows_we_want, ..columns_we_want]
setkeyv(DT, key_columns)
result_dt <- DT[, .SD[1], by = c("mcid")]
setkey(result_dt, NULL)
# dplyr
result_tbl <- term_tbl %>%
  filter(gpa_cumul < 2.0) %>%
  select(all_of(columns_we_want)) %>%
  arrange(mcid, term) %>%
  group_by(mcid) %>%
  slice(1L) %>%
  ungroup()
result_dt
#>                  mcid   term           standing gpa_cumul
#>                <char> <char>             <char>     <num>
#>     1: MCID3111142283  19881 Academic Probation      1.85
#>     2: MCID3111142633  19883 Academic Probation      1.96
#>     3: MCID3111142784  19883 Academic Probation      1.81
#>     4: MCID3111142895  19881 Academic Probation      1.81
#>     5: MCID3111142982  19896 Academic Probation      1.97
#>    ---                                                   
#> 11675: MCID3112897319  20181 Academic Dismissal      1.77
#> 11676: MCID3112897359  20181 Academic Dismissal      1.97
#> 11677: MCID3112897470  20181 Academic Dismissal      1.36
#> 11678: MCID3112897505  20181 Academic Dismissal      1.57
#> 11679: MCID3112898068  20181 Academic Dismissal      1.45

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R. Subset for rows with GPA < 2.0. Like we did above, order rows by ID and term use by() to split-apply-combine operate on all four columns, but retain the first instance (one row) only by ID.

  • data.table. Subset rows and columns; setkeyv() orders the rows and .SD[1] subsets for the first instance only (one row).

  • dplyr. Subset rows and columns; order rows; use slice(1L) to subset the first row by ID.

Check that we have one row per ID as expected.

(n_mcid <- length(unique(result_dt$mcid)))
#> [1] 11679
(n_observation <- nrow(result_dt))
#> [1] 11679
all.equal(n_mcid, n_observation)
#> [1] TRUE

References

Back to top