midfieldr is an R package that provides tools and methods for studying undergraduate student-level records from the MIDFIELD database.
Introduction
Provides tools and demonstrates methods for working with individual undergraduate student-level records (registrar’s data) in R. Tools include filters for program codes, data sufficiency, and timely completion. Methods include gathering blocs of records, computing quantitative metrics such as graduation rate, and creating charts to visualize comparisons. midfieldr is designed to work with data from the MIDFIELD research database, a sample of which is available in the midfielddata data package.
midfieldr provides these functions for manipulating student-level data:
-
add_completion_status()
Determine completion status for every student -
add_data_sufficiency()
Determine data sufficiency for every student -
add_timely_term()
Calculate a timely completion term for every student -
filter_cip()
Filter CIP data to match search strings -
prep_fye_mice()
Prepare FYE data for multiple imputation -
select_required()
Select required midfieldr variables
Additional functions for processing intermediate results:
-
order_multiway()
Order categorical variables of multiway data
R packages in examples and vignettes
-
Data preparation. We use the data.table system and some base R for data manipulation (Dowle and Srinivasan 2022). To assist users who might prefer other systems, the MIDFIELD Institute website (Lord et al. 2024) includes tutorials providing side-by-side base R, data.table, and dplyr solutions to common data shaping tasks using MIDFIELD practice data.
- Charts. Our preferred package for charts is ggplot2 (Wickham 2016). The lattice package (Sarkar 2008) also offers users comprehensive control over graphical elements (though our lattice experience is no longer current).
Usage
In this example, we gather all students ever enrolled in Engineering and summarize their graduation status (in any major), grouping by race/ethnicity and sex. If you are writing your own script to follow along, we use these packages in this example:
Load the practice data. Reduce initial dimensions of data tables.
# Load the practice data
data(student, term, degree)
# Reduce dimensions of source data tables
student <- select_required(student)
term <- select_required(term)
degree <- select_required(degree)
# View example result
term
#> mcid institution term cip6 level
#> <char> <char> <char> <char> <char>
#> 1: MCID3111142225 Institution B 19881 140901 01 First-year
#> 2: MCID3111142283 Institution J 19881 240102 01 First-year
#> 3: MCID3111142283 Institution J 19883 240102 01 First-year
#> ---
#> 639913: MCID3112898894 Institution B 20181 451001 01 First-year
#> 639914: MCID3112898895 Institution B 20181 302001 01 First-year
#> 639915: MCID3112898940 Institution B 20181 050103 01 First-year
Filter for data sufficiency.
# Initialize the working data frame
DT <- term[, .(mcid, cip6)]
# Filter observations for data sufficiency
DT <- add_timely_term(DT, term)
DT <- add_data_sufficiency(DT, term)
DT <- DT[data_sufficiency == "include"]
DT
#> mcid cip6 level_i adj_span timely_term term_i
#> <char> <char> <char> <num> <char> <char>
#> 1: MCID3111142689 090401 01 First-year 6 19941 19883
#> 2: MCID3111142782 260101 01 First-year 6 19941 19883
#> 3: MCID3111142782 260101 01 First-year 6 19941 19883
#> ---
#> 531417: MCID3112870009 240102 01 First-year 6 20003 19951
#> 531418: MCID3112870009 240102 01 First-year 6 20003 19951
#> 531419: MCID3112870009 240102 01 First-year 6 20003 19951
#> lower_limit upper_limit data_sufficiency
#> <char> <char> <char>
#> 1: 19881 20181 include
#> 2: 19881 20096 include
#> 3: 19881 20096 include
#> ---
#> 531417: 19881 20181 include
#> 531418: 19881 20181 include
#> 531419: 19881 20181 include
Filter for degree-seeking students ever enrolled in Engineering.
# Inner join to filter observations for degree-seeking
cols_we_want <- student[, .(mcid)]
DT <- cols_we_want[DT, on = c("mcid"), nomatch = NULL]
# Filter observations for engineering programs
DT <- DT[cip6 %like% "^14"]
# Filter observations for unique students (first instance)
DT <- DT[, .SD[1], by = c("mcid")]
DT
#> mcid cip6 level_i adj_span timely_term term_i
#> <char> <char> <char> <num> <char> <char>
#> 1: MCID3111142965 140102 01 First-year 6 19941 19883
#> 2: MCID3111145102 140102 01 First-year 6 19941 19883
#> 3: MCID3111146537 141001 02 Second-year 5 19931 19883
#> ---
#> 10399: MCID3112641399 141901 01 First-year 6 20181 20123
#> 10400: MCID3112641535 141901 01 First-year 6 20173 20121
#> 10401: MCID3112698681 141901 01 First-year 6 20171 20113
#> lower_limit upper_limit data_sufficiency
#> <char> <char> <char>
#> 1: 19881 20096 include
#> 2: 19881 20096 include
#> 3: 19881 20096 include
#> ---
#> 10399: 19881 20181 include
#> 10400: 19881 20181 include
#> 10401: 19881 20181 include
Determine completion status.
# Add completion status variable
DT <- add_completion_status(DT, degree)
DT
#> mcid cip6 level_i adj_span timely_term term_i
#> <char> <char> <char> <num> <char> <char>
#> 1: MCID3111142965 140102 01 First-year 6 19941 19883
#> 2: MCID3111145102 140102 01 First-year 6 19941 19883
#> 3: MCID3111146537 141001 02 Second-year 5 19931 19883
#> ---
#> 10399: MCID3112641399 141901 01 First-year 6 20181 20123
#> 10400: MCID3112641535 141901 01 First-year 6 20173 20121
#> 10401: MCID3112698681 141901 01 First-year 6 20171 20113
#> lower_limit upper_limit data_sufficiency term_degree completion_status
#> <char> <char> <char> <char> <char>
#> 1: 19881 20096 include 19901 timely
#> 2: 19881 20096 include 19893 timely
#> 3: 19881 20096 include 19913 timely
#> ---
#> 10399: 19881 20181 include 20163 timely
#> 10400: 19881 20181 include 20143 timely
#> 10401: 19881 20181 include 20181 late
Aggregate observations by groupings.
# Left join to add race/ethnicity and sex variables (omit unknowns)
cols_we_want <- student[, .(mcid, race, sex)]
DT <- student[DT, on = c("mcid")]
DT <- DT[!(race %ilike% "unknown" | sex %ilike% "unknown")]
# Create a variable combining race/ethnicity and sex
DT[, people := paste(race, sex)]
# Aggregate observations by groupings
DT_display <- DT[, .N, by = c("completion_status", "people")]
setorderv(DT_display, c("completion_status", "people"))
DT_display
#> completion_status people N
#> <char> <char> <int>
#> 1: <NA> Asian Female 43
#> 2: <NA> Asian Male 163
#> 3: <NA> Black Female 39
#> ---
#> 33: timely Native American Male 13
#> 34: timely White Female 985
#> 35: timely White Male 4100
Reshape and display results.
# Transform to row-record form
DT_display <- dcast(DT_display, people ~ completion_status, value.var = "N", fill = 0)
# Prepare the table for display
setcolorder(DT_display, c("people", "timely", "late"))
setkeyv(DT_display, c("people"))
setnames(DT_display,
old = c("people", "timely", "late", "NA"),
new = c("People", "Timely completion", "Late completion", "Did not complete")
)
People | Timely completion | Late completion | Did not complete |
---|---|---|---|
Asian Female | 87 | 4 | 43 |
Asian Male | 315 | 19 | 163 |
Black Female | 26 | 3 | 39 |
Black Male | 80 | 5 | 84 |
Hispanic Female | 36 | 3 | 31 |
Hispanic Male | 181 | 19 | 102 |
International Female | 110 | 9 | 51 |
International Male | 501 | 41 | 280 |
Native American Female | 2 | 0 | 2 |
Native American Male | 13 | 3 | 6 |
White Female | 985 | 51 | 386 |
White Male | 4100 | 269 | 2034 |
“Timely completion” is the count of graduates completing their programs in no more than 6 years; “Late completion” is the count of those graduating in more than 6 years; “Did not complete” is the count of non-graduates.
Reminder. midfielddata is suitable for learning to work with student-level data but not for drawing inferences about program attributes or student experiences. midfielddata supplies practice data, not research data.
Installation
Install from CRAN with:
install.packages("midfieldr")
Install latest development version from GitHub with:
install.packages("pak")
pak::pkg_install("MIDFIELDR/midfieldr")
midfieldr interacts with practice data provided in the midfielddata data package. midfielddata is too large for CRAN so we deploy it to a drat package repository. Install midfielddata using:
install.packages("midfielddata",
repos = "https://MIDFIELDR.github.io/drat/",
type = "source"
)
The installed size of midfielddata is about 24 Mb, so the installation takes some time.
More information
MIDFIELD. A database of anonymized student-level records for approximately 2.4M undergraduates at 21 US institutions from 1987-2022. Access to this database requires a confidentiality agreement and Institutional Review Board (IRB) approval for human subjects research.
midfielddata. An R data package that supplies anonymized student-level records for 98,000 undergraduates at three US institutions from 1988-2018. A sample of the MIDFIELD database, midfielddata provides practice data for the tools and methods in the midfieldr package.
Acknowledgments
This work was supported by the US National Science Foundation through grant numbers 1545667 and 2142087.