Basic transforms

Image: Transformers by danjo paluska is licensed under CC BY 2.0

Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.

In this section we demonstrate adding and removing columns, editing column names, creating two new columns by splitting a character column, converting the class of a column, and recoding the values in a column.

Setup

Load packages.

Format data as described in the Introduction.

data(student, term, degree)
# base R
student_df <- data.frame(student)
term_df <- data.frame(term)
degree_df <- data.frame(degree)
# data.table
student_dt <- copy(student)
term_dt <- copy(term)
degree_dt <- copy(degree)
# dplyr
student_tbl <- tibble(student)
term_tbl <- tibble(term)
degree_tbl <- tibble(degree)


Adding columns

Task. Add a column to the student table for total SAT score.

columns_we_want <- c("mcid", "sat_math", "sat_verbal", "act_comp")
# base R
result_df <- student_df[, columns_we_want, drop = FALSE]
result_df$sat_total <- result_df$sat_math + result_df$sat_verbal
# data.table
result_dt <- student_dt[, ..columns_we_want]
result_dt[, sat_total := sat_math + sat_verbal]
# dplyr
result_tbl <- student_tbl %>%
  select(all_of(columns_we_want)) %>%
  mutate(sat_total = sat_math + sat_verbal)
result_dt
#>                  mcid sat_math sat_verbal act_comp sat_total
#>                <char>    <num>      <num>    <num>     <num>
#>     1: MCID3111142225       NA         NA       NA        NA
#>     2: MCID3111142283      560        230       NA       790
#>     3: MCID3111142290      510        380       NA       890
#>     4: MCID3111142294      640        460       NA      1100
#>     5: MCID3111142299      600        500       NA      1100
#>    ---                                                      
#> 97551: MCID3112898886      530        560       29      1090
#> 97552: MCID3112898890      570        580       25      1150
#> 97553: MCID3112898894      510        590       24      1100
#> 97554: MCID3112898895      420        590       32      1010
#> 97555: MCID3112898940      470        540       32      1010

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R. df$new_col creates the new column.

  • data.table. := alters or creates columns “in place”. The data.table at hand gets altered instead of a new one being created.

  • dplyr. New columns are created with mutate().

Removing columns

In previous examples, when selecting some columns we have implicitly removed every column not selected (if any). Here we explicitly select the columns to be removed.

Task. Using the results from the previous example, remove the SAT math and verbal columns.

# base R
result_df[c("sat_math", "sat_verbal")] <- NULL
# data.table
result_dt[, c("sat_math", "sat_verbal") := NULL]
# dplyr
result_tbl <- result_tbl %>%
  select(-sat_math, -sat_verbal)
result_dt
#>                  mcid act_comp sat_total
#>                <char>    <num>     <num>
#>     1: MCID3111142225       NA        NA
#>     2: MCID3111142283       NA       790
#>     3: MCID3111142290       NA       890
#>     4: MCID3111142294       NA      1100
#>     5: MCID3111142299       NA      1100
#>    ---                                  
#> 97551: MCID3112898886       29      1090
#> 97552: MCID3112898890       25      1150
#> 97553: MCID3112898894       24      1100
#> 97554: MCID3112898895       32      1010
#> 97555: MCID3112898940       32      1010

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R.

  • data.table. Use a vector of column names and := to remove the columns “in place”.

  • dplyr. Use select() with minus signs.

Editing column names

Task. Using the results from the previous example, rename the columns.

In each case, we take care to explicitly match old and new names.

# base R
names(result_df)[names(result_df) == "mcid"] <- "Student"
names(result_df)[names(result_df) == "sat_total"] <- "SAT"
names(result_df)[names(result_df) == "act_comp"] <- "ACT"
# data.table
setnames(result_dt,
  old = c("mcid", "sat_total", "act_comp"),
  new = c("Student", "SAT", "ACT")
)
# dplyr
result_tbl <- result_tbl %>%
    rename(Student = mcid, SAT = sat_total, ACT = act_comp)
result_dt
#>               Student   ACT   SAT
#>                <char> <num> <num>
#>     1: MCID3111142225    NA    NA
#>     2: MCID3111142283    NA   790
#>     3: MCID3111142290    NA   890
#>     4: MCID3111142294    NA  1100
#>     5: MCID3111142299    NA  1100
#>    ---                           
#> 97551: MCID3112898886    29  1090
#> 97552: MCID3112898890    25  1150
#> 97553: MCID3112898894    24  1100
#> 97554: MCID3112898895    32  1010
#> 97555: MCID3112898940    32  1010

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R. While other approaches work too, this approach ensures that old and new names are explicitly matched.

  • data.table. setnames() edits the column names in place.

  • dplyr. rename() uses new_name = old_name syntax.

Split a character column

Task. In the term data, create new columns for year YYYY and term code T by splitting the character term column.

In each system, we use the base R substr() function to split the term character value into a 4-character year in one new column and a 1-character term-code in a new second column. We also drop rows with missing values (if any).

columns_we_want <- c("mcid", "term")
# base R
DF <- term_df[, columns_we_want, drop = FALSE]
DF$year <- substr(DF$term, 1, 4)
DF$term_code <- substr(DF$term, 5, 5)
result_df <- na.omit(DF)
# data.table
DT <- term_dt[, ..columns_we_want]
DT[, c("year", "term_code") := .(substr(term, 1, 4), substr(term, 5, 5))]
result_dt <- na.omit(DT)
# dplyr
result_tbl <- term_tbl %>%
  select(mcid, term) %>%
  mutate(year = substr(term, 1, 4), term_code = substr(term, 5, 5)) %>%
  drop_na()
options(datatable.print.topn = 5)
result_dt
#>                   mcid   term   year term_code
#>                 <char> <char> <char>    <char>
#>      1: MCID3111142225  19881   1988         1
#>      2: MCID3111142283  19881   1988         1
#>      3: MCID3111142283  19883   1988         3
#>      4: MCID3111142283  19885   1988         5
#>      5: MCID3111142283  19891   1989         1
#>     ---                                       
#> 639911: MCID3112898886  20181   2018         1
#> 639912: MCID3112898890  20181   2018         1
#> 639913: MCID3112898894  20181   2018         1
#> 639914: MCID3112898895  20181   2018         1
#> 639915: MCID3112898940  20181   2018         1

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R.

  • data.table.

  • dplyr.

Convert column class

Task. Using the results from the previous example, convert character year to numerical class.

# base R
result_df$year <- as.double(result_df$year)
# data.table
result_dt[, year := as.double(year)]
# dplyr
result_tbl <- result_tbl %>%
  mutate(year = as.double(year))
result_dt
#>                   mcid   term  year term_code
#>                 <char> <char> <num>    <char>
#>      1: MCID3111142225  19881  1988         1
#>      2: MCID3111142283  19881  1988         1
#>      3: MCID3111142283  19883  1988         3
#>      4: MCID3111142283  19885  1988         5
#>      5: MCID3111142283  19891  1989         1
#>     ---                                      
#> 639911: MCID3112898886  20181  2018         1
#> 639912: MCID3112898890  20181  2018         1
#> 639913: MCID3112898894  20181  2018         1
#> 639914: MCID3112898895  20181  2018         1
#> 639915: MCID3112898940  20181  2018         1

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R.

  • data.table.

  • dplyr.

Recode column values

When recoding the values of a column we have a choice: we can recode in place, eliminating the original values entirely, or we can add new column with the new coding. In this first example, we add a column.

Add a column with the recoded values

Task. Using the results from the previous example, add a column that re-codes the term-codes from numbers to academic terms “Fall”, “Winter”, etc. The existing column is term_code; the new column is term_name.

# base R
result_df <- within(result_df, {
    term_name <- NA
    term_name[term_code == "1"] <- "Fall"
    term_name[term_code == "2"] <- "Winter"
    term_name[term_code == "3"] <- "Spring"
    term_name[term_code == "4"] <- "Summer 1"
    term_name[term_code == "5"] <- "Summer 2"
    term_name[term_code == "6"] <- "Summer 3"
})
# data.table
result_dt[, term_name := fcase(
  term_code == "1", "Fall",
  term_code == "2", "Winter",
  term_code == "3", "Spring",
  term_code == "4", "Summer 1",
  term_code == "5", "Summer 2",
  term_code == "6", "Summer 3"
)]
# dplyr
result_tbl <- result_tbl %>%
  mutate(term_name = case_match(
    term_code,
    "1" ~ "Fall",
    "2" ~ "Winter",
    "3" ~ "Spring",
    "4" ~ "Summer 1",
    "5" ~ "Summer 2",
    "6" ~ "Summer 3"
  ))
result_dt
#>                   mcid   term  year term_code term_name
#>                 <char> <char> <num>    <char>    <char>
#>      1: MCID3111142225  19881  1988         1      Fall
#>      2: MCID3111142283  19881  1988         1      Fall
#>      3: MCID3111142283  19883  1988         3    Spring
#>      4: MCID3111142283  19885  1988         5  Summer 2
#>      5: MCID3111142283  19891  1989         1      Fall
#>     ---                                                
#> 639911: MCID3112898886  20181  2018         1      Fall
#> 639912: MCID3112898890  20181  2018         1      Fall
#> 639913: MCID3112898894  20181  2018         1      Fall
#> 639914: MCID3112898895  20181  2018         1      Fall
#> 639915: MCID3112898940  20181  2018         1      Fall

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R. within() is similar to with() but allows us to modify a data frame. We create the new column with term_name <- NA then systematically edit its values based on the values in term_code.

  • data.table. term_name is created using := then fcase() assigns its values based on logical expressions operating on term_code.

  • dplyr. term_name is created using mutate() then case_match() assigns its values by finding matches to the term_code values.

Recode strings in place

Here, we partially recode string values.

Task. In the degree data, shorten the values in the degree column by substituting “BA” for “Bachelor of Arts” and “BS” for Bachelor of Science” to make the printout more readable.

Some of the degree values are recorded as “Bachelor Science” instead of “Bachelor of Science”, so to simplify the process the first thing we do is remove all instances of the preposition “of”. We also select three columns to make the printout more readable.

columns_we_want <- c("mcid", "term_degree", "degree")
# base R
result_df <- degree_df[, columns_we_want, drop = FALSE]
result_df$degree <- gsub(" of", "", result_df$degree)
result_df$degree <- sub("Bachelor Science", "BS", result_df$degree)
result_df$degree <- sub("Bachelor Arts", "BA", result_df$degree)
result_df$degree <- sub("Bachelor Fine Arts", "BFA", result_df$degree)
# data.table
result_dt <- degree_dt[, ..columns_we_want]
result_dt[, degree := gsub(" of", "", degree)]
result_dt[, degree := sub("Bachelor Science", "BS", degree)]
result_dt[, degree := str_replace(degree, "Bachelor Arts", "BA")]
result_dt[, degree := str_replace(degree, "Bachelor Fine Arts", "BFA")]
# dplyr/stringr
result_tbl <- degree_tbl %>%
 select(all_of(columns_we_want)) %>%
 mutate(degree = str_replace_all(degree, " of", "")) %>% 
 mutate(degree = str_replace(degree, "Bachelor Science", "BS")) %>%
 mutate(degree = str_replace(degree, "Bachelor Arts", "BA")) %>%
 mutate(degree = str_replace(degree, "Bachelor Fine Arts", "BFA"))
result_dt
#>                  mcid term_degree                                  degree
#>                <char>      <char>                                  <char>
#>     1: MCID3111142225       19881            BS in Electrical Engineering
#>     2: MCID3111142290       19921            BS in Electrical Engineering
#>     3: MCID3111142294       19903            BS in Electrical Engineering
#>     4: MCID3111142299       19921            BS in Electrical Engineering
#>     5: MCID3111142689       19913                        BA in Journalism
#>    ---                                                                   
#> 49661: MCID3112829602       20173  BA in Political Science and Government
#> 49662: MCID3112831015       20181                         BA in Geography
#> 49663: MCID3112839623       20181                       BS in Linguistics
#> 49664: MCID3112845220       20181                       BS in Mathematics
#> 49665: MCID3112845673       20174 BS in Speech Communication and Rhetoric

check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE

Comments

  • base R. gsub() replaces all matches in a string; sub() replaces the first match only.

  • data.table. data.table does not have its own string manipulation functions, so here we show two operations using base R and two operations using stringr.

  • dplyr/stringr. str_replace_all() replaces all matches in a string; str_replace() replaces the first match only.

References

Back to top