library("midfielddata")
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("tidyr")
library("wrapr")
Data selection
Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.
In this section we demonstrate subsetting rows and columns in various configurations, removing records with incomplete data, and reordering rows and columns.
Setup
Load packages.
Format data as described in the Introduction.
data(student, term)
# base R
student_df <- data.frame(student)
term_df <- data.frame(term)
Subsetting rows and columns
A common task when working with data is selecting a subset of rows, columns, or both.
Indirect indexing
Using intermediate variables columns_we_want
and rows_we_want
that refer to the values used to subset rows and columns.
Task. Prepare data to summarize domestic students by race, sex, and institution.
Our first step might be to subset columns (institution, race, sex, and us_citizen) and subset rows such that US citizenship is true.
# character vector of column names
columns_we_want <- c("institution", "race", "sex", "us_citizen")
# base R
rows_we_want <- student_df$us_citizen == "Yes"
result_df <- student_df[rows_we_want, columns_we_want, drop = FALSE]
# data.table
rows_we_want <- student_dt$us_citizen == "Yes"
result_dt <- student_dt[rows_we_want, ..columns_we_want]
result_dt
#> institution race sex us_citizen
#> <char> <char> <char> <char>
#> 1: Institution B Asian Male Yes
#> 2: Institution J Asian Female Yes
#> 3: Institution J Asian Male Yes
#> 4: Institution J Asian Male Yes
#> 5: Institution J Asian Male Yes
#> ---
#> 90387: Institution B White Female Yes
#> 90388: Institution B White Female Yes
#> 90389: Institution B White Female Yes
#> 90390: Institution B White Female Yes
#> 90391: Institution B Other/Unknown Male Yes
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Base R’s default behavior when selecting a single column is to return a vector. Thus, when working with
[,]
always add a third argumentdrop = FALSE
to ensure a data frame is returned.data.table. The
..
notation indicates thatcolumns_we_want
refers to a character vector of column names and is not itself a column name.dplyr. The added notation
all_of()
in effect serves the same purpose in dplyr as..
does in data.table.
Direct indexing
Column and row selection that is typed directly into the subsetting operators [,]
or filter()
and select()
without using intermediate variables.
Task. Same as previous example.
# base R
result_df <- student_df[student_df$us_citizen == "Yes",
c("institution", "race", "sex", "us_citizen"),
drop = FALSE
]
# data.table
result_dt <- student_dt[us_citizen == "Yes", .(institution, race, sex, us_citizen)]
result_dt
#> institution race sex us_citizen
#> <char> <char> <char> <char>
#> 1: Institution B Asian Male Yes
#> 2: Institution J Asian Female Yes
#> 3: Institution J Asian Male Yes
#> 4: Institution J Asian Male Yes
#> 5: Institution J Asian Male Yes
#> ---
#> 90387: Institution B White Female Yes
#> 90388: Institution B White Female Yes
#> 90389: Institution B White Female Yes
#> 90390: Institution B White Female Yes
#> 90391: Institution B Other/Unknown Male Yes
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. For direct indexing, intermediate variable names in
[,]
are replaced verbatim by their values.data.table. The notation
.()
subsets columns by name without quotation marks; rows are subset without the$
operator.dplyr. Similarly,
select()
operates on unquoted column names; rows are subset without the$
operator.
Subset rows with multiple conditions
Multiple logical conditions are combined with operators such as & (and), | (or), ! (not), etc.
Task. Find IDs of transfer students with 15 or more credit hours transferred.
When a variable is overwritten multiple times to get to a result, we will sometimes employ the variable DF
as an intermediate variable.
columns_we_want <- c("mcid", "transfer", "hours_transfer")
# data.table
result_dt <- student_dt[transfer == "First-Time Transfer" & hours_transfer >= 15, ..columns_we_want]
result_dt
#> mcid transfer hours_transfer
#> <char> <char> <num>
#> 1: MCID3111142840 First-Time Transfer 44
#> 2: MCID3111142895 First-Time Transfer 16
#> 3: MCID3111142910 First-Time Transfer 41
#> 4: MCID3111142914 First-Time Transfer 156
#> 5: MCID3111142983 First-Time Transfer 24
#> ---
#> 9458: MCID3112898091 First-Time Transfer 110
#> 9459: MCID3112898410 First-Time Transfer 71
#> 9460: MCID3112898411 First-Time Transfer 68
#> 9461: MCID3112898412 First-Time Transfer 30
#> 9462: MCID3112898801 First-Time Transfer 218
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Two steps required. Subset rows by the original logical conditions; then subset to remove rows with NA across all columns.
rowSums()
assigns 1 to TRUE and 0 to FALSE. If the number of NAs in a row is less than the number of columns, then that row contains non-NA data and we keep it.data.table. Multiple conditions separated by logical AND
&
. Does not generate all-NA rows as in base R.dplyr. Similar.
Subset rows to omit values
Sometimes the intention of a row subset is to keep all rows except those that contain a particular value—implemented with a logical NOT (!
).
Task. Retain all rows of the student
data except those with an “unknown” value (case-insensitive) in the sex
column.
# base R
rows_we_want <- !grepl("unknown", student_df$sex, ignore.case = TRUE)
result_df <- student_df[rows_we_want, , drop = FALSE]
# data.table
result_dt <- student_dt[!sex %ilike% "unknown"]
glimpse(result_dt)
#> Rows: 97,554
#> Columns: 13
#> $ mcid <chr> "MCID3111142225", "MCID3111142283", "MCID3111142290", "…
#> $ institution <chr> "Institution B", "Institution J", "Institution J", "Ins…
#> $ transfer <chr> "First-Time Transfer", "First-Time Transfer", "First-Ti…
#> $ hours_transfer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ race <chr> "Asian", "Asian", "Asian", "Asian", "Asian", "Asian", "…
#> $ sex <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male…
#> $ age_desc <chr> "Under 25", "Under 25", "Under 25", "Under 25", "Under …
#> $ us_citizen <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
#> $ home_zip <chr> NA, "22020", "23233", "20853", "22003", "23234", "24148…
#> $ high_school <chr> NA, NA, "471872", NA, "470080", "471877", NA, NA, NA, N…
#> $ sat_math <dbl> NA, 560, 510, 640, 600, 570, 480, NA, NA, NA, 520, 430,…
#> $ sat_verbal <dbl> NA, 230, 380, 460, 500, 530, 530, NA, NA, NA, 490, 490,…
#> $ act_comp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Uses
grepl()
data.table.
%ilike%
is shorthand forgrepl()
withignore.case = TRUE
dplyr. Also uses
grepl()
Confirm that the correct rows have been deleted.
Subset one column as a vector
Task. Obtain a vector (vec
) of unique IDs in the term
dataset.
Because results are vectors, we compare using all.equal()
instead of check_equiv_frames()
.
# base R
vec_base <- term_df[, "mcid"]
vec_base <- unique(vec_base)
# data.table
vec_dtbl <- term_dt[, mcid]
vec_dtbl <- unique(vec_dtbl)
Comments
base R. For clarity, we deliberately write these steps in two lines of code. One could just as easily write the single line
result_df <- unique(term_df[, "mcid"])
.data.table.
dplyr.
pull(mcid)
is safe as long as themcid
column exists in the data frame from which it is pulled. Safer to usepull(.data$mcid)
as described in (Layton, 2024).
Removing rows with incomplete records
An important variation of subsetting data is removing rows of data that have missing values. If appropriate, limit the dropped rows to those with missing values in specific columns.
Task. We are asked to compare SAT scores by institution. Before doing any analysis, we first remove rows with missing SAT scores.
Delete rows with a missing value in any column.
# data.table
DT <- student_dt[, .(institution, sat_math, sat_verbal)]
result_dt <- na.omit(DT)
result_dt
#> institution sat_math sat_verbal
#> <char> <num> <num>
#> 1: Institution J 560 230
#> 2: Institution J 510 380
#> 3: Institution J 640 460
#> 4: Institution J 600 500
#> 5: Institution J 570 530
#> ---
#> 77096: Institution B 530 560
#> 77097: Institution B 570 580
#> 77098: Institution B 510 590
#> 77099: Institution B 420 590
#> 77100: Institution B 470 540
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Where before we used the name
columns_we_want
to reference a character vector of column names, here we use the character vector itself.data.table.
dplyr.
drop_na()
is from the tidyr package.
Delete rows with a missing value in specified colums.
Task.
result_dt
#> mcid institution transfer hours_transfer
#> <char> <char> <char> <num>
#> 1: MCID3111142283 Institution J First-Time Transfer NA
#> 2: MCID3111142290 Institution J First-Time Transfer NA
#> 3: MCID3111142294 Institution J First-Time Transfer NA
#> 4: MCID3111142299 Institution J First-Time Transfer NA
#> 5: MCID3111142303 Institution J First-Time Transfer NA
#> ---
#> 77096: MCID3112898886 Institution B First-Time in College NA
#> 77097: MCID3112898890 Institution B First-Time in College NA
#> 77098: MCID3112898894 Institution B First-Time in College NA
#> 77099: MCID3112898895 Institution B First-Time in College NA
#> 77100: MCID3112898940 Institution B First-Time in College NA
#> race sex age_desc us_citizen home_zip high_school sat_math
#> <char> <char> <char> <char> <char> <char> <num>
#> 1: Asian Female Under 25 Yes 22020 <NA> 560
#> 2: Asian Male Under 25 Yes 23233 471872 510
#> 3: Asian Male Under 25 Yes 20853 <NA> 640
#> 4: Asian Male Under 25 Yes 22003 470080 600
#> 5: Asian Male Under 25 Yes 23234 471877 570
#> ---
#> 77096: White Female Under 25 Yes 53711 501165 530
#> 77097: White Female Under 25 Yes 54956 501275 570
#> 77098: White Female Under 25 Yes 53716 501160 510
#> 77099: White Female Under 25 Yes 53029 500853 420
#> 77100: Other/Unknown Male Under 25 Yes 20016 090073 470
#> sat_verbal act_comp
#> <num> <num>
#> 1: 230 NA
#> 2: 380 NA
#> 3: 460 NA
#> 4: 500 NA
#> 5: 530 NA
#> ---
#> 77096: 560 29
#> 77097: 580 25
#> 77098: 590 24
#> 77099: 590 32
#> 77100: 540 32
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Using logical operators
!
NOT and|
OR to create a logical vector used as a row index.data.table. The same function name
na.omit()
as base R but with an argument to drops rows with missing values in specified columns. Retains all columns.dplyr. Similarly,
drop_na()
drops rows with missing values in specified columns, retaining all columns.
Ordering rows or columns
We sometimes using the intermediate variables order_index
(integer vector) or order_cols
(character vector) to emphasize the steps of the ordering process.
Ascending rows
In all three systems, the default sort order is increasing (ascending) from rows 1 to N.
Task. Suppose we want to know a student’s level and declared major in their first term at their institution. We use the term
dataset.
We decide to order the rows first by institution, then by ID, then by term. In each case, we select columns in a separate operation so we can highlight the row-ordering operation.
columns_we_want <- c("institution", "mcid", "term", "level", "cip6")
result_dt
#> institution mcid term level cip6
#> <char> <char> <char> <char> <char>
#> 1: Institution B MCID3111142225 19881 01 First-year 140901
#> 2: Institution B MCID3111142689 19883 01 First-year 090401
#> 3: Institution B MCID3111142729 19881 01 First-year 141901
#> 4: Institution B MCID3111142729 19883 01 First-year 141901
#> 5: Institution B MCID3111142819 19881 01 First-year 230101
#> ---
#> 639911: Institution J MCID3112447745 20096 01 First-year 040201
#> 639912: Institution J MCID3112447746 20096 01 First-year 190501
#> 639913: Institution J MCID3112447751 20096 01 First-year 240102
#> 639914: Institution J MCID3112447753 20096 01 First-year 520201
#> 639915: Institution J MCID3112447754 20096 01 First-year 240102
all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE
Comments
base R. Using
with()
, we can list the variables to order by without using the$
operator.data.table.
setorderv()
reorders data in place—hence no assignment operator<-
is required—and it operates on a character vector of column names.dplyr.
arrange()
orders the rows of a data frame by the values of selected columns.
Descending rows
When we want a decreasing (descending) row order, it is typically needed for some but not all of our columns.
Task. Sort students by institution and descending transfer hours.
columns_we_want <- c("institution", "mcid", "hours_transfer")
result_dt
#> institution mcid hours_transfer
#> <char> <char> <num>
#> 1: Institution B MCID3112694097 278
#> 2: Institution B MCID3112788655 271
#> 3: Institution B MCID3112693474 238
#> 4: Institution B MCID3112500991 229
#> 5: Institution B MCID3112879516 228
#> ---
#> 13925: Institution C MCID3112785098 1
#> 13926: Institution C MCID3112785144 1
#> 13927: Institution C MCID3112785478 1
#> 13928: Institution C MCID3112785480 1
#> 13929: Institution C MCID3112785484 1
all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE
Comments
base R. Base R does not provide a general way to sort individual variables in descending order, but a minus sign works with numerical variables. The
decreasing
argument, if used, applies to all variables.data.table. In
setorderv()
, add theorder
argument with1
for ascending and-1
for descending.dplyr. In
arrange()
applydesc()
to the descending variable.
Ordering columns
Whenever we’ve subset columns using indirect or direct indexing, the placement of column names implicitly orders those columns. For example, using
columns_we_want <- c("institution", "race", "sex", "us_citizen")
as a column index yields a data frame with the columns in the order given. Sometimes however we want to move one or more columns to the left in a data frame.
Task. Using the result from the previous example, move transfer hours to the leftmost column followed by ID.
# data.table
setcolorder(result_dt, c("hours_transfer", "mcid"))
result_dt
#> hours_transfer mcid institution
#> <num> <char> <char>
#> 1: 278 MCID3112694097 Institution B
#> 2: 271 MCID3112788655 Institution B
#> 3: 238 MCID3112693474 Institution B
#> 4: 229 MCID3112500991 Institution B
#> 5: 228 MCID3112879516 Institution B
#> ---
#> 13925: 1 MCID3112785098 Institution C
#> 13926: 1 MCID3112785144 Institution C
#> 13927: 1 MCID3112785478 Institution C
#> 13928: 1 MCID3112785480 Institution C
#> 13929: 1 MCID3112785484 Institution C
all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE
Comments
base R.
union()
is a set operation that discards duplicated values. The resulting character vector contains all the column names, reordered, and sorting is performed by[
.data.table.
setcolorder()
changes the position of columns in a data frame, by reference, moving the specified columns to the “front”.dplyr.
relocate()
changes the position of columns in a data frame moving the specified columns to the “front”.