library("midfielddata")
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("tidyr")
library("wrapr")
Going further
Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.
In this section we demonstrate …
Setup
Load packages.
Load datasets from midfielddata.
data(student, term, degree)
Recode strings in degree
for output readability as described in Basic Transforms.
Format data as described in the Introduction.
# base R
student_df <- data.frame(student)
term_df <- data.frame(term)
degree_df <- data.frame(degree)
Advanced row-subsetting by groups
When a specific instance yields a block of rows
When subsetting rows for a specific instance of one variable (first, last, etc.) by group or key variable, we may need to allow for the possible return of more than one row per key.
For example, from the degree
data table, subsetting for the first degree-term for the following student should return two rows, one each for Electrical Engineering and Mechanical Engineering.
id_1 <- "MCID3112353622"
degree[mcid == id_1]
#> mcid institution term_degree cip6 degree
#> <char> <char> <char> <char> <char>
#> 1: MCID3112353622 Institution C 20133 141001 BS in Electrical Engineering
#> 2: MCID3112353622 Institution C 20133 141901 BS in Mechanical Engineering
In contrast, subsetting for the first term by ID for this next student should return one row for their degree in Economics (Summer 2003) and drop the row for their degree in History (Fall 2006).
id_2 <- "MCID3111600469"
degree[mcid == id_2]
#> mcid institution term_degree cip6 degree
#> <char> <char> <char> <char> <char>
#> 1: MCID3111600469 Institution C 20024 450601 BA in Economics
#> 2: MCID3111600469 Institution C 20061 540101 BA in History
Task. Retain all rows of the earliest term in which students earn one or more degrees.
# base R
DF <- degree_df[, key_columns, drop = FALSE]
order_index <- with(DF, order(mcid, term_degree))
DF <- DF[order_index, , drop = FALSE]
DF_by <- by(DF, DF$mcid, function(x)
data.frame(mcid = unique(x$mcid),
term_degree = x$term_degree[1]))
DF_combine <- do.call(rbind, DF_by)
DF <- degree_df[, columns_we_want, drop = FALSE]
result_df <- merge(DF_combine, DF, by = key_columns, all.x = TRUE)
rownames(result_df) <- NULL
result_dt
#> mcid term_degree degree
#> <char> <char> <char>
#> 1: MCID3111142225 19881 BS in Electrical Engineering
#> 2: MCID3111142290 19921 BS in Electrical Engineering
#> 3: MCID3111142294 19903 BS in Electrical Engineering
#> 4: MCID3111142299 19921 BS in Electrical Engineering
#> 5: MCID3111142689 19913 BA in Journalism
#> ---
#> 49614: MCID3112829602 20173 BA in Political Science and Government
#> 49615: MCID3112831015 20181 BA in Geography
#> 49616: MCID3112839623 20181 BS in Linguistics
#> 49617: MCID3112845220 20181 BS in Mathematics
#> 49618: MCID3112845673 20174 BS in Speech Communication and Rhetoric
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Order rows by ID and term; use
by()
for a split-apply-combine strategy to create a data frame with two columns: ID and first degree-term; and left-join usingmerge()
to recover the matching degree data, including rows with more than one degree per student ID.data.table.
setkeyv()
orders the rows and.SD[...]
subsets for the earliest termby
ID, including any multiple rows.dplyr.
arrange()
orders the rows,group_by()
sets ID as the key, andfilter()
subsets for the earliest term, including any multiple rows.
Check that we subset the relevant rows. The result for id_1
(from the earlier example) correctly includes two degrees in the same term.
result_dt[mcid == id_1]
#> mcid term_degree degree
#> <char> <char> <char>
#> 1: MCID3112353622 20133 BS in Electrical Engineering
#> 2: MCID3112353622 20133 BS in Mechanical Engineering
The result for id_2
correctly includes their first degree in Economics and drops their subsequent degree in History.
result_dt[mcid == id_2]
#> mcid term_degree degree
#> <char> <char> <char>
#> 1: MCID3111600469 20024 BA in Economics
When a specific instance yields one row
In contrast to the previous case, there are times when we expect subsetting for a specific instance by group to yield one row only.
For example, suppose we want to identify the first term in which a student’s cumulative GPA dropped below 2.0. By definition, the result should yield only one row (or none) per ID.
# base R
rows_we_want <- term_df$gpa_cumul < 2.0
DF <- term_df[rows_we_want, columns_we_want, drop = FALSE]
order_index <- with(DF, order(mcid, term))
DF <- DF[order_index, , drop = FALSE]
DF_by <- by(DF, DF$mcid, function(x)
data.frame(mcid = x$mcid[1],
term = x$term[1],
standing = x$standing[1],
gpa_cumul = x$gpa_cumul[1]))
result_df <- do.call(rbind, DF_by)
rownames(result_df) <- NULL
result_dt
#> mcid term standing gpa_cumul
#> <char> <char> <char> <num>
#> 1: MCID3111142283 19881 Academic Probation 1.85
#> 2: MCID3111142633 19883 Academic Probation 1.96
#> 3: MCID3111142784 19883 Academic Probation 1.81
#> 4: MCID3111142895 19881 Academic Probation 1.81
#> 5: MCID3111142982 19896 Academic Probation 1.97
#> ---
#> 11675: MCID3112897319 20181 Academic Dismissal 1.77
#> 11676: MCID3112897359 20181 Academic Dismissal 1.97
#> 11677: MCID3112897470 20181 Academic Dismissal 1.36
#> 11678: MCID3112897505 20181 Academic Dismissal 1.57
#> 11679: MCID3112898068 20181 Academic Dismissal 1.45
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Subset for rows with GPA < 2.0. Like we did above, order rows by ID and term use
by()
to split-apply-combine operate on all four columns, but retain the first instance (one row) only by ID.data.table. Subset rows and columns;
setkeyv()
orders the rows and.SD[1]
subsets for the first instance only (one row).dplyr. Subset rows and columns; order rows; use
slice(1L)
to subset the first row by ID.
Check that we have one row per ID as expected.