Data selection

Image: *Select* by Mike Lawrence is licensed under CC BY 2.0

Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.

In this section we demonstrate subsetting rows and columns in various configurations, removing records with incomplete data, and reordering rows and columns.

Setup

Load packages.

library("midfielddata")
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("tidyr")
library("wrapr")

Format data as described in the Introduction.

data(student, term)

# base R
student_df <- data.frame(student)
term_df <- data.frame(term)

# data.table
student_dt <- copy(student)
term_dt <- copy(term)

# dplyr
student_tbl <- tibble(student)
term_tbl <- tibble(term)

Subsetting rows and columns

A common task when working with data is selecting a subset of rows, columns, or both.

Indirect indexing

Using intermediate variables columns_we_want and rows_we_want that refer to the values used to subset rows and columns.

Task. Prepare data to summarize domestic students by race, sex, and institution.

Our first step might be to subset columns (institution, race, sex, and us_citizen) and subset rows such that US citizenship is true.

# character vector of column names
columns_we_want <- c("institution", "race", "sex", "us_citizen")

# base R
rows_we_want <- student_df$us_citizen == "Yes"
result_df <- student_df[rows_we_want, columns_we_want, drop = FALSE]

# data.table
rows_we_want <- student_dt$us_citizen == "Yes"
result_dt <- student_dt[rows_we_want, ..columns_we_want]

# dplyr
rows_we_want <- student_tbl$us_citizen == "Yes"
result_tbl <- student_tbl %>%
  select(all_of(columns_we_want)) %>%
  filter(rows_we_want)

result_dt
#>          institution          race    sex us_citizen
#>               <char>        <char> <char>     <char>
#>     1: Institution B         Asian   Male        Yes
#>     2: Institution J         Asian Female        Yes
#>     3: Institution J         Asian   Male        Yes
#>     4: Institution J         Asian   Male        Yes
#>     5: Institution J         Asian   Male        Yes
#>    ---                                              
#> 90387: Institution B         White Female        Yes
#> 90388: Institution B         White Female        Yes
#> 90389: Institution B         White Female        Yes
#> 90390: Institution B         White Female        Yes
#> 90391: Institution B Other/Unknown   Male        Yes

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. Base R’s default behavior when selecting a single column is to return a vector. Thus, when working with [,] always add a third argument drop = FALSE to ensure a data frame is returned.
data.table. The .. notation indicates that columns_we_want refers to a character vector of column names and is not itself a column name.
dplyr. The added notation all_of() in effect serves the same purpose in dplyr as .. does in data.table.

Direct indexing

Column and row selection that is typed directly into the subsetting operators [,] or filter() and select() without using intermediate variables.

Task. Same as previous example.

# base R
result_df <- student_df[student_df$us_citizen == "Yes",
  c("institution", "race", "sex", "us_citizen"),
  drop = FALSE
]

# data.table
result_dt <- student_dt[us_citizen == "Yes", .(institution, race, sex, us_citizen)]

# dplyr
result_tbl <- student_tbl %>%
  select(institution, race, sex, us_citizen) %>%
  filter(us_citizen == "Yes")

result_dt
#>          institution          race    sex us_citizen
#>               <char>        <char> <char>     <char>
#>     1: Institution B         Asian   Male        Yes
#>     2: Institution J         Asian Female        Yes
#>     3: Institution J         Asian   Male        Yes
#>     4: Institution J         Asian   Male        Yes
#>     5: Institution J         Asian   Male        Yes
#>    ---                                              
#> 90387: Institution B         White Female        Yes
#> 90388: Institution B         White Female        Yes
#> 90389: Institution B         White Female        Yes
#> 90390: Institution B         White Female        Yes
#> 90391: Institution B Other/Unknown   Male        Yes

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. For direct indexing, intermediate variable names in [,] are replaced verbatim by their values.
data.table. The notation .() subsets columns by name without quotation marks; rows are subset without the $ operator.
dplyr. Similarly, select() operates on unquoted column names; rows are subset without the $ operator.

Subset rows with multiple conditions

Multiple logical conditions are combined with operators such as & (and), | (or), ! (not), etc.

Task. Find IDs of transfer students with 15 or more credit hours transferred.

When a variable is overwritten multiple times to get to a result, we will sometimes employ the variable DF as an intermediate variable.

columns_we_want <- c("mcid", "transfer", "hours_transfer")

# base R
rows_we_want <- student_df$transfer == "First-Time Transfer" & student_df$hours_transfer >= 15
DF <- student_df[rows_we_want, columns_we_want, drop = FALSE]

# remove rows with NA in every column
rows_we_want <- rowSums(is.na(DF)) < ncol(DF)
result_df <- DF[rows_we_want, , drop = FALSE]

# data.table
result_dt <- student_dt[transfer == "First-Time Transfer" & hours_transfer >= 15, ..columns_we_want]

# dplyr
result_tbl <- student_tbl %>%
  select(mcid, transfer, hours_transfer) %>%
  filter(transfer == "First-Time Transfer" & hours_transfer >= 15)

result_dt
#>                 mcid            transfer hours_transfer
#>               <char>              <char>          <num>
#>    1: MCID3111142840 First-Time Transfer             44
#>    2: MCID3111142895 First-Time Transfer             16
#>    3: MCID3111142910 First-Time Transfer             41
#>    4: MCID3111142914 First-Time Transfer            156
#>    5: MCID3111142983 First-Time Transfer             24
#>   ---                                                  
#> 9458: MCID3112898091 First-Time Transfer            110
#> 9459: MCID3112898410 First-Time Transfer             71
#> 9460: MCID3112898411 First-Time Transfer             68
#> 9461: MCID3112898412 First-Time Transfer             30
#> 9462: MCID3112898801 First-Time Transfer            218

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. Two steps required. Subset rows by the original logical conditions; then subset to remove rows with NA across all columns. rowSums() assigns 1 to TRUE and 0 to FALSE. If the number of NAs in a row is less than the number of columns, then that row contains non-NA data and we keep it.
data.table. Multiple conditions separated by logical AND &. Does not generate all-NA rows as in base R.
dplyr. Similar.

Subset rows to omit values

Sometimes the intention of a row subset is to keep all rows except those that contain a particular value—implemented with a logical NOT (!).

Task. Retain all rows of the student data except those with an “unknown” value (case-insensitive) in the sex column.

# base R
rows_we_want <- !grepl("unknown", student_df$sex, ignore.case = TRUE)
result_df <- student_df[rows_we_want, , drop = FALSE]

# data.table
result_dt <- student_dt[!sex %ilike% "unknown"]

# dplyr
result_tbl <- student_tbl %>%
  filter(!grepl("unknown", sex, ignore.case = TRUE))

glimpse(result_dt)
#> Rows: 97,554
#> Columns: 13
#> $ mcid           <chr> "MCID3111142225", "MCID3111142283", "MCID3111142290", "…
#> $ institution    <chr> "Institution B", "Institution J", "Institution J", "Ins…
#> $ transfer       <chr> "First-Time Transfer", "First-Time Transfer", "First-Ti…
#> $ hours_transfer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ race           <chr> "Asian", "Asian", "Asian", "Asian", "Asian", "Asian", "…
#> $ sex            <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male…
#> $ age_desc       <chr> "Under 25", "Under 25", "Under 25", "Under 25", "Under …
#> $ us_citizen     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
#> $ home_zip       <chr> NA, "22020", "23233", "20853", "22003", "23234", "24148…
#> $ high_school    <chr> NA, NA, "471872", NA, "470080", "471877", NA, NA, NA, N…
#> $ sat_math       <dbl> NA, 560, 510, 640, 600, 570, 480, NA, NA, NA, 520, 430,…
#> $ sat_verbal     <dbl> NA, 230, 380, 460, 500, 530, 530, NA, NA, NA, 490, 490,…
#> $ act_comp       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. Uses grepl()
data.table. %ilike% is shorthand for grepl() with ignore.case = TRUE
dplyr. Also uses grepl()

Confirm that the correct rows have been deleted.

# Before deletions
sort(unique(student_dt$sex))
#> [1] "Female"  "Male"    "Unknown"
# After deletions 
sort(unique(result_dt$sex))
#> [1] "Female" "Male"

Subset one column as a vector

Task. Obtain a vector (vec) of unique IDs in the term dataset.

Because results are vectors, we compare using all.equal() instead of check_equiv_frames().

# base R
vec_base <- term_df[, "mcid"]
vec_base <- unique(vec_base)

# data.table
vec_dtbl <- term_dt[, mcid]
vec_dtbl <- unique(vec_dtbl)

# dplyr
vec_dplyr <- term_tbl %>%
  pull(mcid) %>%
  unique()

str(vec_dtbl)
#>  chr [1:97555] "MCID3111142225" "MCID3111142283" "MCID3111142290" ...

all.equal(vec_base, vec_dtbl)
#> [1] TRUE
all.equal(vec_dtbl, vec_dplyr)
#> [1] TRUE

Comments

base R. For clarity, we deliberately write these steps in two lines of code. One could just as easily write the single line result_df <- unique(term_df[, "mcid"]).
data.table.
dplyr. pull(mcid) is safe as long as the mcid column exists in the data frame from which it is pulled. Safer to use pull(.data$mcid) as described in (Layton, 2024).

Removing rows with incomplete records

An important variation of subsetting data is removing rows of data that have missing values. If appropriate, limit the dropped rows to those with missing values in specific columns.

Task. We are asked to compare SAT scores by institution. Before doing any analysis, we first remove rows with missing SAT scores.

Delete rows with a missing value in any column.

# base R
DF <- student_df[, c("institution", "sat_math", "sat_verbal"), drop = FALSE]
result_df <- na.omit(DF)

# data.table
DT <- student_dt[, .(institution, sat_math, sat_verbal)]
result_dt <- na.omit(DT)

# dplyr
result_tbl <- student_tbl %>%
  select(institution, sat_math, sat_verbal) %>%
  drop_na()

result_dt
#>          institution sat_math sat_verbal
#>               <char>    <num>      <num>
#>     1: Institution J      560        230
#>     2: Institution J      510        380
#>     3: Institution J      640        460
#>     4: Institution J      600        500
#>     5: Institution J      570        530
#>    ---                                  
#> 77096: Institution B      530        560
#> 77097: Institution B      570        580
#> 77098: Institution B      510        590
#> 77099: Institution B      420        590
#> 77100: Institution B      470        540

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. Where before we used the name columns_we_want to reference a character vector of column names, here we use the character vector itself.
data.table.
dplyr. drop_na() is from the tidyr package.

Delete rows with a missing value in specified colums.

Task.

# base R
rows_we_want <- !(is.na(student_df$institution) |
  is.na(student_df$sat_math) |
  is.na(student_df$sat_verbal))
result_df <- student_df[rows_we_want, , drop = FALSE]

# data.table
result_dt <- na.omit(student_dt, cols = c("institution", "sat_math", "sat_verbal"))

# dplyr
result_tbl <- student_tbl %>%
  drop_na(institution, sat_math, sat_verbal)

result_dt
#>                  mcid   institution              transfer hours_transfer
#>                <char>        <char>                <char>          <num>
#>     1: MCID3111142283 Institution J   First-Time Transfer             NA
#>     2: MCID3111142290 Institution J   First-Time Transfer             NA
#>     3: MCID3111142294 Institution J   First-Time Transfer             NA
#>     4: MCID3111142299 Institution J   First-Time Transfer             NA
#>     5: MCID3111142303 Institution J   First-Time Transfer             NA
#>    ---                                                                  
#> 77096: MCID3112898886 Institution B First-Time in College             NA
#> 77097: MCID3112898890 Institution B First-Time in College             NA
#> 77098: MCID3112898894 Institution B First-Time in College             NA
#> 77099: MCID3112898895 Institution B First-Time in College             NA
#> 77100: MCID3112898940 Institution B First-Time in College             NA
#>                 race    sex age_desc us_citizen home_zip high_school sat_math
#>               <char> <char>   <char>     <char>   <char>      <char>    <num>
#>     1:         Asian Female Under 25        Yes    22020        <NA>      560
#>     2:         Asian   Male Under 25        Yes    23233      471872      510
#>     3:         Asian   Male Under 25        Yes    20853        <NA>      640
#>     4:         Asian   Male Under 25        Yes    22003      470080      600
#>     5:         Asian   Male Under 25        Yes    23234      471877      570
#>    ---                                                                       
#> 77096:         White Female Under 25        Yes    53711      501165      530
#> 77097:         White Female Under 25        Yes    54956      501275      570
#> 77098:         White Female Under 25        Yes    53716      501160      510
#> 77099:         White Female Under 25        Yes    53029      500853      420
#> 77100: Other/Unknown   Male Under 25        Yes    20016      090073      470
#>        sat_verbal act_comp
#>             <num>    <num>
#>     1:        230       NA
#>     2:        380       NA
#>     3:        460       NA
#>     4:        500       NA
#>     5:        530       NA
#>    ---                    
#> 77096:        560       29
#> 77097:        580       25
#> 77098:        590       24
#> 77099:        590       32
#> 77100:        540       32

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

base R. Using logical operators ! NOT and | OR to create a logical vector used as a row index.
data.table. The same function name na.omit() as base R but with an argument to drops rows with missing values in specified columns. Retains all columns.
dplyr. Similarly, drop_na() drops rows with missing values in specified columns, retaining all columns.

Ordering rows or columns

We sometimes using the intermediate variables order_index (integer vector) or order_cols (character vector) to emphasize the steps of the ordering process.

Ascending rows

In all three systems, the default sort order is increasing (ascending) from rows 1 to N.

Task. Suppose we want to know a student’s level and declared major in their first term at their institution. We use the term dataset.
We decide to order the rows first by institution, then by ID, then by term. In each case, we select columns in a separate operation so we can highlight the row-ordering operation.

columns_we_want <- c("institution", "mcid", "term", "level", "cip6")

# base R
DF <- term_df[, columns_we_want, drop = FALSE]
order_index <- with(DF, order(institution, mcid, term))
result_df <- DF[order_index, , drop = FALSE]

# data.table
result_dt <- term_dt[, ..columns_we_want]
order_cols <- c("institution", "mcid", "term")
setorderv(result_dt, order_cols)

# dplyr
result_tbl <- term_tbl %>%
  select(institution, mcid, term, level, cip6) %>%
  arrange(institution, mcid, term)

result_dt
#>           institution           mcid   term         level   cip6
#>                <char>         <char> <char>        <char> <char>
#>      1: Institution B MCID3111142225  19881 01 First-year 140901
#>      2: Institution B MCID3111142689  19883 01 First-year 090401
#>      3: Institution B MCID3111142729  19881 01 First-year 141901
#>      4: Institution B MCID3111142729  19883 01 First-year 141901
#>      5: Institution B MCID3111142819  19881 01 First-year 230101
#>     ---                                                         
#> 639911: Institution J MCID3112447745  20096 01 First-year 040201
#> 639912: Institution J MCID3112447746  20096 01 First-year 190501
#> 639913: Institution J MCID3112447751  20096 01 First-year 240102
#> 639914: Institution J MCID3112447753  20096 01 First-year 520201
#> 639915: Institution J MCID3112447754  20096 01 First-year 240102

all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE

Comments

base R. Using with(), we can list the variables to order by without using the $ operator.
data.table. setorderv() reorders data in place—hence no assignment operator <- is required—and it operates on a character vector of column names.
dplyr. arrange() orders the rows of a data frame by the values of selected columns.

Descending rows

When we want a decreasing (descending) row order, it is typically needed for some but not all of our columns.

Task. Sort students by institution and descending transfer hours.

columns_we_want <- c("institution", "mcid", "hours_transfer")

# base R
DF <- student_df[, columns_we_want, drop = FALSE]
DF <- na.omit(DF)
order_index <- with(DF, order(institution, -hours_transfer, mcid))
result_df <- DF[order_index, , drop = FALSE]

# data.table
DT <- student_dt[, ..columns_we_want]
result_dt <- na.omit(DT)
order_cols <- c("institution", "hours_transfer", "mcid")
setorderv(result_dt, order_cols, order = c(1, -1, 1))

# dplyr
result_tbl <- student_tbl %>%
  select(institution, mcid, hours_transfer) %>%
  drop_na() %>%
  arrange(institution, desc(hours_transfer), mcid)

result_dt
#>          institution           mcid hours_transfer
#>               <char>         <char>          <num>
#>     1: Institution B MCID3112694097            278
#>     2: Institution B MCID3112788655            271
#>     3: Institution B MCID3112693474            238
#>     4: Institution B MCID3112500991            229
#>     5: Institution B MCID3112879516            228
#>    ---                                            
#> 13925: Institution C MCID3112785098              1
#> 13926: Institution C MCID3112785144              1
#> 13927: Institution C MCID3112785478              1
#> 13928: Institution C MCID3112785480              1
#> 13929: Institution C MCID3112785484              1

all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE

Comments

base R. Base R does not provide a general way to sort individual variables in descending order, but a minus sign works with numerical variables. The decreasing argument, if used, applies to all variables.
data.table. In setorderv(), add the order argument with 1 for ascending and -1 for descending.
dplyr. In arrange() apply desc() to the descending variable.

Ordering columns

Whenever we’ve subset columns using indirect or direct indexing, the placement of column names implicitly orders those columns. For example, using

    columns_we_want <- c("institution", "race", "sex", "us_citizen")

as a column index yields a data frame with the columns in the order given. Sometimes however we want to move one or more columns to the left in a data frame.

Task. Using the result from the previous example, move transfer hours to the leftmost column followed by ID.

# base R
column_order <- union(c("hours_transfer", "mcid"), names(result_df))
result_df <- result_df[, column_order, drop = FALSE]

# data.table
setcolorder(result_dt, c("hours_transfer", "mcid"))

# dplyr
result_tbl <- result_tbl %>%
  relocate(hours_transfer, mcid)

result_dt
#>        hours_transfer           mcid   institution
#>                 <num>         <char>        <char>
#>     1:            278 MCID3112694097 Institution B
#>     2:            271 MCID3112788655 Institution B
#>     3:            238 MCID3112693474 Institution B
#>     4:            229 MCID3112500991 Institution B
#>     5:            228 MCID3112879516 Institution B
#>    ---                                            
#> 13925:              1 MCID3112785098 Institution C
#> 13926:              1 MCID3112785144 Institution C
#> 13927:              1 MCID3112785478 Institution C
#> 13928:              1 MCID3112785480 Institution C
#> 13929:              1 MCID3112785484 Institution C

all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE

Comments

base R. union() is a set operation that discards duplicated values. The resulting character vector contains all the column names, reordered, and sorting is performed by [.
data.table. setcolorder() changes the position of columns in a data frame, by reference, moving the specified columns to the “front”.
dplyr. relocate() changes the position of columns in a data frame moving the specified columns to the “front”.

References

Layton, R. (2024). A subtle flaw in pull(). Blog post. https://graphdr.github.io/data-stories/posts/2024-08-01-subtle-flaw-pull/