library("midfielddata")
library("data.table")
library("dplyr", warn.conflicts = FALSE)
library("tidyr")
library("wrapr")
Aggregating transforms
Using base R, data.table, and dplyr/tidyr to transform MIDFIELD data as needed for your analysis.
In this section we demonstrate aggregating many observations of a variable into a set of summary observations, typically grouped by one or more grouping variables. Summaries include counts as well as means, medians, quartiles, etc.
Setup
Load packages.
Format data as described in the Introduction.
data(student, term)
# base R
student_df <- data.frame(student)
term_df <- data.frame(term)
Counting
Base R offers several approaches to the task of counting by groups, for example, by()
, aggregate()
, table()
, tapply()
, or a split-apply-combine strategy. For counting, I think table()
requires the least amount of cognitive overhead when the desired result is a data frame.
With one grouping variable
Task. Count the number of students in the student
dataset grouped by race.
# base R
freq_table <- table(student_df$race, dnn = "race")
result_df <- as.data.frame(freq_table,
responseName = "N",
stringsAsFactors = FALSE)
result_dt
#> race N
#> <char> <int>
#> 1: Asian 4193
#> 2: Black 1860
#> 3: International 7354
#> 4: Latine 5386
#> 5: Native American 403
#> 6: Other/Unknown 4509
#> 7: White 73850
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R.
table()
creates a contingency table—counts of every combination of the grouping variables—that we convert to a data frame. Column names are assigned using thednn
andresponseName
arguments.data.table. The
.N
function is an integer containing the number of rows in each group, with grouping variables assigned using theby
argument.setorderv()
is optional.dplyr.
count()
counts the number of unique values in a group. The response column name is assigned using thename
argument (default is lower casen
).
With two or more grouping variables
Data.table and dplyr count the frequency of each grouping-variable combination present in the data; base R counts the frequencies of all combinations, inserting counts of zero as needed.
Task. Count the number of students in the student
dataset grouped by institution, sex, and transfer status.
grouping_variables <- c("institution", "sex", "transfer")
# base R
freq_table <- table(student_df$institution,
student_df$sex,
student_df$transfer,
dnn = grouping_variables)
result_df <- as.data.frame(freq_table,
responseName = "N",
stringsAsFactors = FALSE)
result_df <- result_df[result_df$N > 0, , drop = FALSE]
result_dt
#> institution sex transfer N
#> <char> <char> <char> <int>
#> 1: Institution B Male First-Time in College 16998
#> 2: Institution B Female First-Time in College 15689
#> 3: Institution B Male First-Time Transfer 6946
#> 4: Institution B Female First-Time Transfer 6026
#> 5: Institution B Unknown First-Time in College 1
#> 6: Institution C Female First-Time in College 9991
#> 7: Institution C Male First-Time in College 8779
#> 8: Institution C Female First-Time Transfer 3978
#> 9: Institution C Male First-Time Transfer 3964
#> 10: Institution J Male First-Time in College 12051
#> 11: Institution J Female First-Time in College 9180
#> 12: Institution J Male First-Time Transfer 2413
#> 13: Institution J Female First-Time Transfer 1539
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R. Similar to the previous case with more grouping variables plus dropping rows with a count of zero to match the other systems’ outputs.
data.table. Similar to the previous case with more grouping variables plus ordering rows by ascending institution (
order = 1
) and descending N (order = -1
).dplyr. Similar to the previous case with more grouping variables plus arranging rows with ascending institution (default) and descending N using
desc()
.
Summarizing
Statistical summaries are commonly encountered to help describe trends in data, e.g., mean, median, standard deviation, minima, maxima, etc.
Of the several summarizing options available in base R, by()
seems the simplest to provide here. It generalizes easily to multiple and different summaries and the only cognitive overhead is the need to know about do.call()
, rbind()
, and anonymous functions.
With one grouping variable
Task. Using the term
data table, determine the 10% quantile, median, and 90% quantile of term GPA by academic standing (good standing, academic warning, etc.) plus the number of observations used in the summary. In all systems, we use the base R quanitile()
function to determine the 10% and 90% quantiles.
# base R
DF <- term_df[!is.na(term_df$gpa_term), , drop = FALSE]
DF_by <- by(DF, list(DF$standing), function(df) {
data.frame(standing = unique(df$standing),
N = nrow(df),
quant_10 = quantile(df$gpa_term, 0.10),
median_gpa = median(df$gpa_term),
quant_90 = quantile(df$gpa_term, 0.90))
})
result_df <- do.call(rbind, DF_by)
result_dt
#> standing N quant_10 median_gpa quant_90
#> <char> <int> <num> <num> <num>
#> 1: Good Standing 589883 2.33 3.14 3.90
#> 2: Academic Probation 18489 1.27 1.79 2.30
#> 3: Academic Warning 25961 1.14 1.72 1.93
#> 4: Academic Dismissal 2208 1.00 1.53 1.89
check_equiv_frames(result_df, result_dt)
#> [1] TRUE
check_equiv_frames(result_dt, result_tbl)
#> [1] TRUE
Comments
base R.
by()
is an implementation ofsplit() + lapply()
(the first two parts of the well-known split-apply-combine method) that yields a list of data frames, one per each combination the grouping variables.rbind()
combines those results into a single data frame.function(df)
is an anonymous function, that is, a function used without assigning it a name in the workspace.data.table.
dplyr.
With two or more grouping variables
Task. Same as the previous example except grouped by academic standing and level.
# base R
DF <- term_df[!is.na(term_df$gpa_term), , drop = FALSE]
DF_by <- by(DF, list(DF$standing, DF$level), FUN = function(df) {
data.frame(standing = unique(df$standing),
level = unique(df$level),
N = nrow(df),
quant_10 = quantile(df$gpa_term, 0.10),
median_gpa = median(df$gpa_term),
quant_90 = quantile(df$gpa_term, 0.90))
})
result_df <- do.call(rbind, DF_by)
order_index <- order(-result_df$median_gpa, -result_df$quant_90)
result_df <- result_df[order_index, , drop = FALSE]
result_dt
#> standing level N quant_10 median_gpa quant_90
#> <char> <char> <int> <num> <num> <num>
#> 1: Good Standing 05 Fifth-year Plus 38498 2.500 3.35 4.000
#> 2: Good Standing 04 Fourth-year 105146 2.440 3.28 3.940
#> 3: Good Standing 03 Third-year 140593 2.380 3.18 3.910
#> 4: Good Standing 02 Second-year 161786 2.310 3.08 3.850
#> 5: Good Standing 01 First-year 143860 2.250 3.00 3.820
#> 6: Academic Probation 03 Third-year 1030 1.309 2.00 2.670
#> 7: Academic Probation 04 Fourth-year 226 1.370 1.98 2.870
#> 8: Academic Probation 02 Second-year 4318 1.300 1.90 2.560
#> 9: Academic Probation 05 Fifth-year Plus 37 1.442 1.80 3.000
#> 10: Academic Probation 01 First-year 12878 1.250 1.75 2.080
#> 11: Academic Warning 01 First-year 3931 1.260 1.75 1.940
#> 12: Academic Warning 02 Second-year 9911 1.200 1.74 1.940
#> 13: Academic Warning 03 Third-year 6831 1.030 1.70 1.930
#> 14: Academic Warning 04 Fourth-year 4078 1.000 1.70 1.930
#> 15: Academic Warning 05 Fifth-year Plus 1210 1.000 1.70 1.920
#> 16: Academic Dismissal 01 First-year 1671 1.000 1.54 1.900
#> 17: Academic Dismissal 03 Third-year 75 1.000 1.53 1.866
#> 18: Academic Dismissal 02 Second-year 440 1.000 1.50 1.860
#> 19: Academic Dismissal 04 Fourth-year 20 0.700 1.42 1.692
#> 20: Academic Dismissal 05 Fifth-year Plus 2 1.062 1.31 1.558
#> standing level N quant_10 median_gpa quant_90
all.equal(result_df, result_dt, check.attributes = FALSE)
#> [1] TRUE
all.equal(result_dt, result_tbl, check.attributes = FALSE)
#> [1] TRUE
Comments
base R. Add the second grouping variable
level
in two places.data.table. Add a second grouping variable to the
by
argument. Order by descending median GPA with the 90% quantile used to break ties.dplyr. Add a second grouping variable to
group_by()
.