Reshaping transforms

Image: SAREC field day 2018 by Univ Wyoming Extension CT is licensed under CC BY-NC-ND 2.0

Using cdata, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis (no base R transforms in this section).

In this section, we cover reshaping data frames by moving data between rows and columns, often called pivoting.

Terminology

Pivoting operations go by a number of different names, as listed in Table 1 (Mount & Zumel, 2019).

Table 1: Reshaping operations
System Pivot "wider" Pivot "longer"
colloquially called "tall to wide" "wide to tall"
Microsoft Excel pivot un-pivot
databases pivot anti-pivot
databases crosstab shred
R unstack stack
reshape, reshape2 cast melt


Pivoting is possible in base R but the package versions in data.table, tidyr, and cdata are significantly better tools. Table 2 lists the reshaping functions we use in this work.

Table 2: Reshaping functions we use
R package Pivot "wider" Pivot "longer"
data.table dcast() melt()
tidyr pivot_wider() pivot_longer()
cdata pivot_to_rowrecs() unpivot_to_blocks()


The data set we use in this section is shown in Table 3, where “starters” is the number of students starting in an engineering program, “ever enrolled” is the number who were enrolled at least one term in the program, and “graduates” is the number graduating from that program.

Table 3. Row-records form
program starters ever_enrolled graduates
Civil 983 1504 936
Electrical 1080 1490 736
Industrial 285 353 238
Mechanical 1705 2306 1354


These data are in “wide” format, with one row per program. In this form, the “coordinates” of a measurement (a count of students) are the value of the primary key (program) and a column name.1

1 We refer again to Mount & Zumel (2019) for the concept of coordinatized data.

In Table 4, these data are transformed to “tall” format, with three rows per program. As indicated by the cell color-coding, the names of three columns in Table 3 are now values in the peer-group column in Table 4 and all counts of students are collected in the N column. In this form, the coordinates of a measurement (N) are given by the composite key (program + peer_group).

Table 4. Blocks form
program peer_group N
Civil starters 983
Electrical starters 1080
Industrial starters 285
Mechanical starters 1705
Civil ever_enrolled 1504
Electrical ever_enrolled 1490
Industrial ever_enrolled 353
Mechanical ever_enrolled 2306
Civil graduates 936
Electrical graduates 736
Industrial graduates 238
Mechanical graduates 1354


You can see why “wide” and “tall” are commonly used to describe these two formats. Yet “wider” implies having more columns, and (while true here) that is not always the case.

What is always true is that the first format comprises row records, or one row per primary key, and the second format comprises blocks, or multiple rows per primary key (Mount & Zumel, 2021). As long as the data are properly “coordinatized”, the transformations are invertible.

In our opinion, blocks and row records describe these data formats more memorably than the alternatives, so we use them to define our transforms as well:

  • blocks to row-records (i.e., pivoting or “tall to wide”)

  • row-records to blocks (i.e., un-pivoting or “wide to tall”)

Why transform?

We transform data between blocks and row records to meet specific needs. Most data processing steps, for example, have a preferred format for how data is arranged in rows and columns. Some packages, like ggplot2, require blocks; others, like the likert package, require row records if the survey data is in summary form (frequencies counted) but requires blocks if not summarized, i.e., one row per respondent, one column per question.

Row records are generally preferred for publishing a data table such as Table 3 and they support analytical tasks as well. For example, the academic program metric “stickiness” is the ratio of the number of graduates from a program to the number ever enrolled in the program. With a data frame df organized like the Table 3 data, computing stickiness requires one line of code operating on values in the same row, e.g.,

# base R
df$stickiness <- df$graduates / df$ever_enrolled

Blocks, on the other hand, are convenient for grouping and summarizing operations and, as stated earlier, are required for making charts with ggplot2. For example, with a data frame blocks organized like the Table 4 data, the three column names are variables used as arguments in the aes() function that appear in the chart on the horizontal scale (N), the vertical scale (programs), and the legend (peer groups).

library(ggplot2)
ggplot(blocks, aes(x = N, 
                   y = program, 
                   color = peer_group, 
                   shape = peer_group)) +
  geom_point(size = 3) +
  theme_light()

Setup

Load packages.

Format data as described in the Introduction. The case_blocs data set is taken from the midfieldr case study with students from four engineering programs organized in three blocs or peer-groups: starters, ever-enrolled, and graduates.

case_blocs <- fread("data/case_blocs.csv")
# base R
case_blocs_df <- data.frame(case_blocs)
# data.table
case_blocs_dt <- copy(case_blocs)
# dplyr
case_blocs_tbl <- tibble(case_blocs)
case_blocs_dt
#>                  mcid    program peer_group
#>                <char>     <char>     <char>
#>     1: MCID3111142965 Electrical   starters
#>     2: MCID3111145102 Electrical   starters
#>     3: MCID3111150194 Industrial   starters
#>     4: MCID3111156083 Electrical   starters
#>     5: MCID3111156325 Electrical   starters
#>    ---                                     
#> 12966: MCID3112618553 Mechanical  graduates
#> 12967: MCID3112618574 Mechanical  graduates
#> 12968: MCID3112618976 Mechanical  graduates
#> 12969: MCID3112619484 Electrical  graduates
#> 12970: MCID3112641535 Mechanical  graduates


Count by program and peer group as described in Aggregating transforms. The result is in block-records format.

key_columns <- c("program", "peer_group") 
# base R
freq_table <- table(case_blocs_df$program,
                    case_blocs_df$peer_group,
                    dnn = key_columns)
blocks_df <- as.data.frame(freq_table, 
                         responseName = "N",
                         stringsAsFactors = FALSE)
# data.table
blocks_dt <- case_blocs_dt[, .N, by = key_columns]
setorder(blocks_dt, program, -N)
# dplyr
blocks_tbl <- case_blocs_tbl %>%
 count(program, peer_group, name = "N")
blocks_dt
#>        program    peer_group     N
#>         <char>        <char> <int>
#>  1:      Civil ever_enrolled  1504
#>  2:      Civil      starters   983
#>  3:      Civil     graduates   936
#>  4: Electrical ever_enrolled  1490
#>  5: Electrical      starters  1080
#>  6: Electrical     graduates   736
#>  7: Industrial ever_enrolled   353
#>  8: Industrial      starters   285
#>  9: Industrial     graduates   238
#> 10: Mechanical ever_enrolled  2306
#> 11: Mechanical      starters  1705
#> 12: Mechanical     graduates  1354

check_equiv_frames(blocks_df, blocks_dt)
#> [1] TRUE
check_equiv_frames(blocks_dt, blocks_tbl)
#> [1] TRUE

Blocks to row-records

In this transform, we group a set of rows that go together (matching on keys) and combine them into one row by adding columns. The result is in row-records form, with one row per program.

Task. Transform the three-case_blocss data frame from block-records to row-records format, using program as the primary key.

# data.table
rowrecs_dt <- dcast(blocks_dt, 
                    program ~ peer_group, 
                    value.var = "N")
                    setkey(rowrecs_dt, NULL)
# tidyr
rowrecs_tbl <- blocks_tbl %>%
  pivot_wider(id_cols = program, 
              names_from = peer_group, 
              values_from = N)
# cdata
rowrecs_df <- pivot_to_rowrecs(blocks_df,  
                               rowKeyColumns = "program", 
                               columnToTakeKeysFrom = "peer_group",
                               columnToTakeValuesFrom = "N")
rowrecs_dt
#>       program ever_enrolled graduates starters
#>        <char>         <int>     <int>    <int>
#> 1:      Civil          1504       936      983
#> 2: Electrical          1490       736     1080
#> 3: Industrial           353       238      285
#> 4: Mechanical          2306      1354     1705

check_equiv_frames(rowrecs_df, rowrecs_dt)
#> [1] TRUE
check_equiv_frames(rowrecs_dt, rowrecs_tbl)
#> [1] TRUE

Comments

  • In all cases, the same transformation occurs: column program remains a row key column; the values of the peer_group column (ever_enrolled, graduates, starters) become new key column names with values taken from the N column.

  • data.table. dcast() uses a formula LHS ~ RHS

  • tidyr pivot_wider(), an updated approach to spread().

  • cdata. pivot_to_rowrecs(), a special case of the more general function blocks_to_rowrecs().

Row-records to blocks

In this transform—the inverse of blocks to row-records—each row is disassembled and used to make a block of many rows. Each block of records in the new data frame has as many rows per record as there are columns we take measurements from.

Task. Transform the results from the previous example from row-records to block-records format, using program as the primary key and all other column names collected under a new secondary key column peer_group.

# data.table
blocks_dt_2 <- melt(
  rowrecs_dt, 
  measure.vars = c("ever_enrolled", "graduates", "starters"), 
  variable.name = "peer_group", 
  value.name = "N", 
  variable.factor = FALSE
)
setorderv(blocks_dt_2, c("program", "peer_group"))
# tidyr
blocks_tbl <- rowrecs_tbl %>%
  pivot_longer(
  cols = c("ever_enrolled", "graduates", "starters"),
  names_to = "peer_group", 
  values_to = "N"
)
# cdata
blocks_df <- unpivot_to_blocks(
  rowrecs_df, 
  columnsToTakeFrom = c("ever_enrolled", "graduates", "starters"),
  nameForNewKeyColumn = "peer_group", 
  nameForNewValueColumn = "N"
)
blocks_dt_2
#>        program    peer_group     N
#>         <char>        <char> <int>
#>  1:      Civil ever_enrolled  1504
#>  2:      Civil     graduates   936
#>  3:      Civil      starters   983
#>  4: Electrical ever_enrolled  1490
#>  5: Electrical     graduates   736
#>  6: Electrical      starters  1080
#>  7: Industrial ever_enrolled   353
#>  8: Industrial     graduates   238
#>  9: Industrial      starters   285
#> 10: Mechanical ever_enrolled  2306
#> 11: Mechanical     graduates  1354
#> 12: Mechanical      starters  1705

check_equiv_frames(blocks_df, blocks_dt_2)
#> [1] TRUE
check_equiv_frames(blocks_dt_2, blocks_tbl)
#> [1] TRUE

Comments

  • In all cases, the same transformation occurs: column names are identified (ever_enrolled, graduates, starters) and collected in a new key column (peer_group); the values from these columns are collected in a new payload column (N); and column names not selected (program) are treated as keys, copied to each record row in the result.

  • data.table melt()

  • tidyr pivot_longer(), an updated approach to gather()

  • cdata unpivot_to_blocks() a special case of the more general function rowrecs_to_blocks().

Confirm that the result is equivalent to the original blocks form.

check_equiv_frames(blocks_dt, blocks_dt_2)
#> [1] TRUE

MIDFIELD data keys

The previous sections have focused our attention on the importance of keys, that is, variables that alone or in combination uniquely identify a row of data. Here we use wrapr::checkColsFormUniqueKeys() to check if rows of data are uniquely addressable by the columns named. For example, for the Table 3 data in row records form, program is such a key:

checkColsFormUniqueKeys(rowrecs_dt, c("program"))
#> [1] TRUE

For the Table 4 data in blocks form, two columns are required:

checkColsFormUniqueKeys(blocks_dt, c("program", "peer_group"))
#> [1] TRUE

For the midfielddata practice data, the keys are:

data(student)
checkColsFormUniqueKeys(student, c("mcid"))
#> [1] TRUE

data(term)
checkColsFormUniqueKeys(term, c("mcid", "term"))
#> [1] TRUE

data(course)
checkColsFormUniqueKeys(course, c("mcid", "term_course", "abbrev", "number"))
#> [1] TRUE

data(degree)
checkColsFormUniqueKeys(degree, c("mcid", "degree"))
#> [1] TRUE

Next steps

Reshaping transforms can also be applied to more than one column, though we don’t cover that use case here. For those interested:

References

Mount, J., & Zumel, N. (2019). Coordinatized data: A fluid data specification. Win Vector LLC. http://winvector.github.io/FluidData/RowsAndColumns.html
Mount, J., & Zumel, N. (2021). Block records and row records. Win Vector LLC. https://winvector.github.io/cdata/articles/blocksrecs.html
Back to top