Skip to contents


midfieldr is an R package that provides tools and methods for studying undergraduate student-level records from the MIDFIELD database.

Introduction

Provides tools and demonstrates methods for working with individual undergraduate student-level records (registrar’s data) in R. Tools include filters for program codes, data sufficiency, and timely completion. Methods include gathering blocs of records, computing quantitative metrics such as graduation rate, and creating charts to visualize comparisons. midfieldr is designed to work with data from the MIDFIELD research database, a sample of which is available in the midfielddata data package.

midfieldr provides these functions for processing student-level data:

Additional functions for processing intermediate results:

Notes on syntax.   Throughout this work, we use the data.table package for data manipulation (Dowle and Srinivasan 2022) and the ggplot2 package for charts (Wickham 2016). Some users may prefer base R or dplyr for data (Wickham et al. 2022), or lattice for charts (Sarkar 2008). Each system has its strengths—users are welcome to translate our examples to their preferred syntax.

Usage

In this example, we gather all students ever enrolled in Engineering and summarize their graduation status (in any major), grouping by race/ethnicity and sex. If you are writing your own script to follow along, we use these packages in this example:

Load the practice data. Reduce initial dimensions of data tables.

# Load the practice data
data(student, term, degree)

# Reduce dimensions of source data tables
student <- select_required(student)
term <- select_required(term)
degree <- select_required(degree)

# View example result
term
#>                   mcid   institution  term   cip6         level
#>      1: MCID3111142225 Institution B 19881 140901 01 First-year
#>      2: MCID3111142283 Institution J 19881 240102 01 First-year
#>      3: MCID3111142283 Institution J 19883 240102 01 First-year
#>     ---                                                        
#> 639913: MCID3112898894 Institution B 20181 451001 01 First-year
#> 639914: MCID3112898895 Institution B 20181 302001 01 First-year
#> 639915: MCID3112898940 Institution B 20181 050103 01 First-year

Filter for data sufficiency.

# Initialize the working data frame
DT <- term[, .(mcid, cip6)]

# Filter observations for data sufficiency
DT <- add_timely_term(DT, term)
DT <- add_data_sufficiency(DT, term)
DT <- DT[data_sufficiency == "include"]
DT
#>                   mcid   cip6       level_i adj_span timely_term term_i
#>      1: MCID3111142689 090401 01 First-year        6       19941  19883
#>      2: MCID3111142782 260101 01 First-year        6       19941  19883
#>      3: MCID3111142782 260101 01 First-year        6       19941  19883
#>     ---                                                                
#> 531417: MCID3112870009 240102 01 First-year        6       20003  19951
#> 531418: MCID3112870009 240102 01 First-year        6       20003  19951
#> 531419: MCID3112870009 240102 01 First-year        6       20003  19951
#>         lower_limit upper_limit data_sufficiency
#>      1:       19881       20181          include
#>      2:       19881       20096          include
#>      3:       19881       20096          include
#>     ---                                         
#> 531417:       19881       20181          include
#> 531418:       19881       20181          include
#> 531419:       19881       20181          include

Filter for degree-seeking students ever enrolled in Engineering.

# Inner join to filter observations for degree-seeking
cols_we_want <- student[, .(mcid)]
DT <- cols_we_want[DT, on = c("mcid"), nomatch = NULL]

# Filter observations for engineering programs
DT <- DT[cip6 %like% "^14"]

# Filter observations for unique students (first instance)
DT <- DT[, .SD[1], by = c("mcid")]
DT
#>                  mcid   cip6        level_i adj_span timely_term term_i
#>     1: MCID3111142965 140102  01 First-year        6       19941  19883
#>     2: MCID3111145102 140102  01 First-year        6       19941  19883
#>     3: MCID3111146537 141001 02 Second-year        5       19931  19883
#>    ---                                                                 
#> 10399: MCID3112641399 141901  01 First-year        6       20181  20123
#> 10400: MCID3112641535 141901  01 First-year        6       20173  20121
#> 10401: MCID3112698681 141901  01 First-year        6       20171  20113
#>        lower_limit upper_limit data_sufficiency
#>     1:       19881       20096          include
#>     2:       19881       20096          include
#>     3:       19881       20096          include
#>    ---                                         
#> 10399:       19881       20181          include
#> 10400:       19881       20181          include
#> 10401:       19881       20181          include

Determine completion status.

# Add completion status variable
DT <- add_completion_status(DT, degree)
DT
#>                  mcid   cip6        level_i adj_span timely_term term_i
#>     1: MCID3111142965 140102  01 First-year        6       19941  19883
#>     2: MCID3111145102 140102  01 First-year        6       19941  19883
#>     3: MCID3111146537 141001 02 Second-year        5       19931  19883
#>    ---                                                                 
#> 10399: MCID3112641399 141901  01 First-year        6       20181  20123
#> 10400: MCID3112641535 141901  01 First-year        6       20173  20121
#> 10401: MCID3112698681 141901  01 First-year        6       20171  20113
#>        lower_limit upper_limit data_sufficiency term_degree completion_status
#>     1:       19881       20096          include       19901            timely
#>     2:       19881       20096          include       19893            timely
#>     3:       19881       20096          include       19913            timely
#>    ---                                                                       
#> 10399:       19881       20181          include       20163            timely
#> 10400:       19881       20181          include       20143            timely
#> 10401:       19881       20181          include       20181              late

Aggregate observations by groupings.

# Left join to add race/ethnicity and sex variables (omit unknowns)
cols_we_want <- student[, .(mcid, race, sex)]
DT <- student[DT, on = c("mcid")]
DT <- DT[!(race %ilike% "unknown" | sex %ilike% "unknown")]

# Create a variable combining race/ethnicity and sex
DT[, people := paste(race, sex)]

# Aggregate observations by groupings
DT_display <- DT[, .N, by = c("completion_status", "people")]
setorderv(DT_display, c("completion_status", "people"))
DT_display
#>     completion_status               people    N
#>  1:              <NA>         Asian Female   43
#>  2:              <NA>           Asian Male  163
#>  3:              <NA>         Black Female   39
#> ---                                            
#> 33:            timely Native American Male   13
#> 34:            timely         White Female  985
#> 35:            timely           White Male 4100

Reshape and display results.

# Transform to row-record form
DT_display <- dcast(DT_display, people ~ completion_status, value.var = "N", fill = 0)

# Prepare the table for display
setcolorder(DT_display, c("people", "timely", "late"))
setkeyv(DT_display, c("people"))
setnames(DT_display,
  old = c("people", "timely", "late", "NA"),
  new = c("People", "Timely completion", "Late completion", "Did not complete")
)
Table 1: Completion status of engineering undergraduates in the practice data
People Timely completion Late completion Did not complete
Asian Female 87 4 43
Asian Male 315 19 163
Black Female 26 3 39
Black Male 80 5 84
International Female 110 9 51
International Male 501 41 280
Latine Female 36 3 31
Latine Male 181 19 102
Native American Female 2 0 2
Native American Male 13 3 6
White Female 985 51 386
White Male 4100 269 2034

“Timely completion” is the count of graduates completing their programs in no more than 6 years; “Late completion” is the count of those graduating in more than 6 years; “Did not complete” is the count of non-graduates.

Reminder.   midfielddata is suitable for learning to work with student-level data but not for drawing inferences about program attributes or student experiences. midfielddata supplies practice data, not research data.

Installation

Install with:

install.packages("midfieldr")

Latest development version:

install.packages("pak")
pak::pkg_install("MIDFIELDR/midfieldr")

midfieldr interacts with practice data provided in the midfielddata data package. Install midfielddata using:

install.packages("midfielddata",
  repos = "https://MIDFIELDR.github.io/drat/",
  type = "source"
)

The installed size of midfielddata is about 24 Mb, so the installation takes some time.

More information

MIDFIELD.   A database of anonymized student-level records for approximately 2.4M undergraduates at 21 US institutions from 1987-2022. Access to this database requires a confidentiality agreement and Institutional Review Board (IRB) approval for human subjects research.

midfielddata.   An R data package that supplies anonymized student-level records for 98,000 undergraduates at three US institutions from 1988-2018. A sample of the MIDFIELD database, midfielddata provides practice data for the tools and methods in the midfieldr package.

MIDFIELD Institute.   Materials from the 2023 workshop, including an introduction to R for beginners, chart basics with ggplot2, and data basics with data.table.

Acknowledgments

This work is supported by the US National Science Foundation through grant numbers 1545667 and 2142087.

References

Dowle, Matt, and Arun Srinivasan. 2022. data.table: Extension of ’Data.frame‘. R package version 1.14.6. https://CRAN.R-project.org/package=data.table.
Sarkar, Deepayan. 2008. lattice: Multivariate Data Visualization with R. New York: Springer. http://lmdvr.r-forge.r-project.org.
Wickham, Hadley. 2016. ggplot2: Elegant Graphics for Data Analysis. ISBN 978-3-319-24277-4; Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. dplyr: A Grammar of Data Manipulation. R package version 1.0.10. https://CRAN.R-project.org/package=dplyr.