Skip to contents


midfielddata is an R data package that supplies anonymized student-level records for 98,000 undergraduates from the MIDFIELD database. Provides practice data for the tools and methods of midfieldr.

Introduction

Data at the “student-level” refers to information collected by undergraduate institutions on individual students, including:

  • course information, e.g., course name & number, credit hours, and student grades
  • term information, e.g., program, academic standing, and grade point average
  • student demographic information, e.g., age, sex, and race/ethnicity
  • degree information, e.g., institution, program, term, and baccalaureate degree

midfielddata provides anonymized student-level records for 98,000 undergraduates at three US institutions from 1988 through 2018, collected in four data tables keyed by student ID.

Table 1. Practice datasets in midfielddata.
Dataset Each row is Students Rows Columns Memory
course one student per course 97,555 3,289,532 12 324.3 MB
term one student per term 97,555 639,915 13 72.8 MB
student one student 97,555 97,555 13 17.3 MB
degree one student per degree 49,543 49,665 5 5.2 MB

The data in midfielddata are a proportionate stratified sample of the MIDFIELD database, but are not suitable for drawing inferences about program attributes or student experiences—midfielddata are for practice, not research.

Notes on syntax.   We use data.table for data manipulation. Some users may prefer base R or dplyr. Each system has its strengths—users are welcome to translate our examples to their preferred syntax.

format(Sys.Date(), "%Y-%m-%d") # Today's date
#> [1] "2024-05-16"
packageVersion("midfielddata") # Student-level records practice data
#> [1] '0.2.1'
packageVersion("data.table") # For data manipulation
#> [1] '1.15.4'

Usage

Start.   If you are writing your own script to follow along, we use these packages in this vignette:

Load data tables.   Data tables can be loaded individually or collectively as needed.

# Load one table as needed
data(student)

# Or load multiple tables
data(course, term, degree)

We display the records for one specific student, using their ID to subset each dataset.

# One student ID
id_we_want <- "MCID3112192438"

Student.   As expected, student yields one row per student.

# Observations for a selected ID
student[mcid == id_we_want]
#>              mcid   institution              transfer hours_transfer  race
#> 1: MCID3112192438 Institution C First-Time in College             NA White
#>       sex age_desc us_citizen home_zip high_school sat_math sat_verbal act_comp
#> 1: Female Under 25        Yes    80521        <NA>      580        390       27

Course.   For this student, the records span 47 rows, one row per course.

# Observations for a selected ID
course[mcid == id_we_want]
#>               mcid   institution term_course                         course
#>  1: MCID3112192438 Institution C       20051 Key Academic Community Seminar
#>  2: MCID3112192438 Institution C       20051       Humans and Other Animals
#>  3: MCID3112192438 Institution C       20051            Health and Wellness
#> ---                                                                        
#> 45: MCID3112192438 Institution C       20093            Health and the Mind
#> 46: MCID3112192438 Institution C       20093   Social Psychology Laboratory
#> 47: MCID3112192438 Institution C       20093                    Group Study
#>     abbrev number section         type              faculty_rank hours_course
#>  1:     KA    192     009         <NA>                Instructor            3
#>  2:   BZCC    101     002         <NA>       Assistant Professor            3
#>  3:   EXCC    145     004         <NA> Non-Academic Professional            3
#> ---                                                                          
#> 45:    PSY    121     001 Face-to-Face Non-Academic Professional            1
#> 46:    PSY    317     L02 Face-to-Face        Graduate Assistant            2
#> 47:    PSY    496     004 Face-to-Face                Instructor            3
#>     grade                        discipline_midfield
#>  1:     A                           Academic Support
#>  2:     B Biological and Biomedical Sciences: Botany
#>  3:     A           Education: Physical and Coaching
#> ---                                                 
#> 45:    A+                                 Psychology
#> 46:     A                                 Psychology
#> 47:    A+                                 Psychology

Term.   Here, the records span 10 rows, one row per term.

# Observations for a selected ID
term[mcid == id_we_want]
#>               mcid   institution  term   cip6              level      standing
#>  1: MCID3112192438 Institution C 20051 451101      01 First-year Good Standing
#>  2: MCID3112192438 Institution C 20053 190701      01 First-year Good Standing
#>  3: MCID3112192438 Institution C 20061 451101     02 Second-year Good Standing
#>  4: MCID3112192438 Institution C 20063 451101     02 Second-year Good Standing
#>  5: MCID3112192438 Institution C 20071 451101      03 Third-year Good Standing
#>  6: MCID3112192438 Institution C 20073 451101      03 Third-year Good Standing
#>  7: MCID3112192438 Institution C 20081 451101      03 Third-year Good Standing
#>  8: MCID3112192438 Institution C 20083 451101     04 Fourth-year Good Standing
#>  9: MCID3112192438 Institution C 20091 451101     04 Fourth-year Good Standing
#> 10: MCID3112192438 Institution C 20093 451101 05 Fifth-year Plus Good Standing
#>     coop hours_term hours_term_attempt hours_cumul hours_cumul_attempt gpa_term
#>  1:   No         15                 15          15                  15     3.80
#>  2:   No         11                 11          26                  26     3.40
#>  3:   No         16                 16          42                  42     3.25
#>  4:   No          8                  8          50                  50     3.81
#>  5:   No         12                 12          62                  62     3.75
#>  6:   No         13                 13          75                  75     3.38
#>  7:  Yes         14                 14          89                  89     3.79
#>  8:   No         16                 16         105                 105     3.75
#>  9:   No         13                 13         118                 118     4.00
#> 10:   No         12                 12         130                 130     4.00
#>     gpa_cumul
#>  1:      3.80
#>  2:      3.63
#>  3:      3.49
#>  4:      3.54
#>  5:      3.58
#>  6:      3.54
#>  7:      3.58
#>  8:      3.61
#>  9:      3.65
#> 10:      3.68

Degree.   In this example, the records span 2 rows, one row per degree. The degrees were earned in the same term, Spring 2009.

# Observations for a selected ID
degree[mcid == id_we_want]
#>              mcid   institution term_degree   cip6
#> 1: MCID3112192438 Institution C       20093 420101
#> 2: MCID3112192438 Institution C       20093 451101
#>                               degree
#> 1: Bachelor of Science in Psychology
#> 2:     Bachelor of Arts in Sociology

Not all students with more than one degree earn them in the same term. For example, the next student earned a degree in 1996 and a second degree in 1999. In most analyses, only the first baccalaureate degree would be used.

# Observations for a different ID
degree[mcid == "MCID3111315508"]
#>              mcid   institution term_degree   cip6
#> 1: MCID3111315508 Institution C       19961 260101
#> 2: MCID3111315508 Institution C       19994 260701
#>                                        degree
#> 1: Bachelor of Science in Biological Sciences
#> 2:      Bachelor of Science in Animal Biology

Installation

Install with:

install.packages("midfielddata",
  repos = "https://MIDFIELDR.github.io/drat/",
  type = "source"
)

The installed size of midfielddata is about 24 Mb, so installation will take longer than that of a conventional CRAN package. Also because of its size, the package is not hosted on CRAN (with its 5 MB size limit)—instead, we host it on the MIDFIELDR drat repository as indicated above.

Link to installation instructions for midfieldr below.

More information

midfieldr
A companion R package that provides tools and methods for studying undergraduate student-level records from the MIDFIELD database.

MIDFIELD
A database of anonymized student-level records for approximately 2.4M undergraduates at 21 US institutions from 1987-2022. Access to this database requires a confidentiality agreement and Institutional Review Board (IRB) approval for human subjects research. For a detailed description of the database, see (Ohland & Long, 2016).

Acknowledgments

This work was supported by the US National Science Foundation through grant numbers 1545667 and 2142087.

References

Ohland, M. W., & Long, R. A. (2016). The Multiple-Institution Database for Investigating Engineering Longitudinal Development: An experiential case study of data sharing and reuse. Advances in Engineering Education, 5(2). https://advances.asee.org/wp-content/uploads/vol05/issue02/Papers/AEE-18-Ohland.pdf