Graduation rate is a widely used, though flawed, measure of academic achievement.
The American Council on Education estimates that the conventional definition of graduation rate may exclude up to 60% of students at 4-year institutions (Cook and Hartle 2011). Nevertheless, as Cook and Hartle explain,
… in the eyes of the public, policy makers, and the media, graduation rate is a clear, simple, and logical—if often misleading—number.
Recognizing that graduation rate is a popular metric, we propose a definition of graduation rate that includes all conventionally excluded students except migrators. You can skip the FYE content in this vignette if your study includes no FYE-style Engineering programs.
This vignette in the MIDFIELD workflow.
- Planning
- Initial processing
- Blocs
- Groupings
-
Metrics
-
Graduation rate
- Stickiness
-
Graduation rate
- Displays
Definitions
- graduation rate
-
Graduation rate is the ratio of the number of program “starter-graduates” (i.e., graduates from the program in which they started) to the number of program starters .
- bloc
-
A grouping of student-level data dealt with as a unit, for example, starters, students ever-enrolled, graduates, transfer students, traditional and non-traditional students, migrators, etc.
- starters
-
Bloc of degree-seeking students in their initial terms enrolled in degree-granting programs.
- starter-graduates
-
Subset of the starters bloc who are graduates (timely completers) from their starting programs.
- timely completion criterion
-
Completing a program in no more than a specified span of years, in many cases, within 6 years after admission (150% of the “normal” 4-year span), or possibly less for some transfer students.
- migrators
-
Bloc of students who leave one program to enroll in another. Also called switchers.
- undecided/unspecified
-
The MIDFIELD taxonomy includes the non-IPEDS code (CIP 999999) for Undecided or Unspecified indicating instances in which a student has not declared a major or an institution had not recorded a program.
Starters and migrators
As they pertain to the graduation rate metric, relationships among starters, migrators, and graduates (timely completers) of a given program P are illustrated in Figure 1.
The overall rectangle represents the set of students ever enrolled in program P.
The interior rectangle represents the set of graduates (timely completers) of program P.
Region 1 (shaded) represents the graduation rate denominator , the set of starters in program P.
Region 2 (shaded) represents the graduation rate numerator , the subset of starters who are also graduates of program P.
Region 3 (unshaded) represents the set of students excluded from the graduation rate metric, depending on how “program” is defined as discussed below.
When calculating graduation rate, whether migrator-graduates are
included in the count of graduates depends how a program is defined in
terms of CIP codes.
Institution level. Graduation rate computed at the institution level includes all migrators within the institution. For example, starters in Engineering (CIP 14) who graduate in Business (CIP 52) are both starters and timely completers at the institution level. IPEDS defines this rate as the institution completion rate.
2-digit CIP. Graduation rate includes migrator graduates within the same 2-digit CIP. For example, starters in Engineering (CIP 14) graduating in Business (CIP 52) are excluded from the count of Business graduates, but migrators within Engineering (all 6-digit CIP codes starting with 14) are both starters and timely completers in Engineering.
4-digit CIP. Similar to the 2-digit case. For example, starters in Electrical Engineering (CIP 1410) graduating in Mechanical Engineering (CIP 1419) are excluded from the count of Mechanical Engineering graduates, but migrators within Electrical Engineering (all 6-digit CIP codes starting with 1410) are both starters and timely completers in Electrical Engineering.
6-digit CIP. Rarely used. Graduation rate at this CIP level excludes all migrators from the count of graduates.
Multiple CIPs. In some cases, a single program or major includes different 4-digit CIPs. For example, migrators between Systems Engineering (CIP 1427), Industrial Engineering (CIP 1435), Manufacturing Engineering (CIP 1436), and Operations Research (CIP 1437) might be considered both starters and timely completers in a general program of Industrial & Systems Engineering.
Who is a starter?
In the US, the predominant definition of graduation rate is that established by the US Department of Education, Integrated Postsecondary Education Data System (IPEDS). The IPEDS definition underlies the finding cited earlier that a graduation rate metric may exclude up to 60% of students.
Many of the IPEDS exclusions relate to how starters are defined. By expanding the starters definition, MIDFIELD proposes a graduation rate definition that includes all conventionally excluded students except migrators.
- graduation rate (IPEDS)
-
The fraction of a cohort of full-time, first-time, degree-seeking undergraduates who complete their program within a percentage (100%, 150%, or 200%) of the “normal” time (typically 4 years) as defined by the institution. IPEDS excludes students who attend college part-time, who transfer between institutions, and who start in Winter or Spring terms (NCES-IPEDS 2020).
- graduation rate (MIDFIELD)
-
The fraction of a cohort of degree-seeking undergraduates who complete their program in a timely manner (typically 6 years). MIDFIELD includes students who attend college part-time, who transfer between institutions, and who start in any term. Table 1 summarizes the comparison between the IPEDS and MIDFIELD graduation rate definitions.
Item | IPEDS | MIDFIELD | MIDFIELD notes |
---|---|---|---|
completion span: | 4, 6, or 8 years | 4, 6, or 8 years | Typical usage is 6 years |
students admitted in: | Summer/Fall only | any term | |
part-time students are: | excluded | included | Timely completion same as full-time students |
transfer students are: | excluded | included | Timely completion span adjusted for level at entry |
- First-Year Engineering (FYE) starters
-
We estimate the degree-granting engineering program in which an FYE student would have enrolled had they not been required to enroll in FYE. The FYE proxy, a 6-digit CIP code, denotes the program of which the FYE student can be considered a starter. For additional details, see the vignette FYE proxies.
Method
Demonstrating the following elements of a MIDFIELD workflow.
Planning. The metric is graduation rate. Required blocs are starters and the subset of starters who graduate in their starting major. Grouping variables are program, race/ethnicity, and sex. Programs are the four Engineering programs used throughout.
Initial processing. Filter the student-level records for data sufficiency and degree-seeking.
Blocs. Gather starters, filter by program. Gather graduates, filter by program, filter by starters’ IDs and programs.
Groupings. Add grouping variables.
Metrics Summarize by grouping variables and compute graduation rate.
Displays Create multiway chart and results table.
Reminder. midfielddata datasets are for practice, not research.
Load data
Start. If you are writing your own script to follow along, we use these packages in this article:
Load. Practice datasets. View data dictionaries via
?student
, ?term
, ?degree
.
# Load practice data
data(student, term, degree)
Loads with midfieldr. Prepared data. View data
dictionaries via ?study_programs
,
?baseline_mcid
, ?fye_proxy
.
study_programs
(derived in Programs).baseline_mcid
(derived in Blocs).fye_proxy
(derived in FYE proxies).
Initial processing
Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)
# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)
Initialize. Use the term
and
student
data tables to obtain a data frame of student IDs
meeting the data sufficiency and degree-seeking criteria. Appled to the
practice data, this procedure yields the baseline_mcid
data
frame derived in Blocs and included with
midfieldr.
# Working data frame
DT <- copy(baseline_mcid)
DT
#> mcid
#> <char>
#> 1: MCID3111142689
#> 2: MCID3111142782
#> 3: MCID3111142881
#> ---
#> 76873: MCID3112785480
#> 76874: MCID3112800920
#> 76875: MCID3112870009
Starters
Starters. The summary code chunk from Starters
# Isolate starting term
DT <- term[DT, .(mcid, term, cip6), on = c("mcid")]
DT <- DT[!cip6 %like% "999999"]
setorderv(DT, cols = c("mcid", "term"))
DT <- DT[, .SD[which.min(term)], by = "mcid"]
DT <- DT[, .(mcid, cip6)]
DT <- unique(DT)
# Continue for starters with FYE
DT <- fye_proxy[DT, .(mcid, cip6, proxy), on = c("mcid")]
DT[, start := fcase(
cip6 == "140102", proxy,
cip6 != "140102", cip6
)]
DT <- DT[, .(mcid, start)]
# Filter by program on start
join_labels <- copy(study_programs)
join_labels <- join_labels[, .(program, start = cip6)]
DT <- join_labels[DT, on = c("start"), nomatch = NULL]
DT[, start := NULL]
DT <- unique(DT)
Copy. To prepare for joining with graduates.
# Prepare for joining
setcolorder(DT, c("mcid"))
starters <- copy(DT)
starters
#> mcid program
#> <char> <char>
#> 1: MCID3111142965 EE
#> 2: MCID3111145102 EE
#> 3: MCID3111150194 ISE
#> ---
#> 4051: MCID3112619118 EE
#> 4052: MCID3112619484 EE
#> 4053: MCID3112619666 ME
Graduates
Initialize. The data frame of baseline IDs is the intake for this section.
# Working data frame
DT <- copy(baseline_mcid)
Graduates The summary code chunk from Graduates
# Gather graduates, degree CIPs and terms
DT <- add_timely_term(DT, term)
DT <- add_completion_status(DT, degree)
DT <- DT[completion_status == "timely"]
DT <- degree[DT, .(mcid, term_degree, cip6), on = c("mcid")]
# Filter by programs and first degree terms
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT <- DT[, .SD[which.min(term_degree)], by = "mcid"]
DT[, c("cip6", "term_degree") := NULL]
DT <- unique(DT)
DT
#> mcid program
#> <char> <char>
#> 1: MCID3111142965 EE
#> 2: MCID3111145102 EE
#> 3: MCID3111146537 EE
#> ---
#> 3262: MCID3112618976 ME
#> 3263: MCID3112619484 EE
#> 3264: MCID3112641535 ME
Starter-graduates
This section introduces new material—not adapted from the reusable code sections of other vignettes.
For a graduation rate metric, a timely completer is counted among the graduates only if they start and complete the same program.
Filter. Use an inner join to filter the graduates by ID and program to match the IDs and programs of starters.
# Starter-graduates
DT <- starters[DT, on = c("mcid", "program"), nomatch = NULL]
Copy. To prepare for joining with starters.
# Prepare for joining
setcolorder(DT, c("mcid"))
graduates <- copy(DT)
graduates
#> mcid program
#> <char> <char>
#> 1: MCID3111142965 EE
#> 2: MCID3111145102 EE
#> 3: MCID3111150194 ISE
#> ---
#> 1787: MCID3112617717 ME
#> 1788: MCID3112618976 ME
#> 1789: MCID3112619484 EE
Closer look
Examining the records of selected students in detail.
Example 1. The student is a starter and a timely completer in Industrial/Systems Engineering (ISE). They appear in both blocs.
# Same ID in different blocs
mcid_we_want <- "MCID3111150194"
starters[mcid == mcid_we_want]
#> mcid program
#> <char> <char>
#> 1: MCID3111150194 ISE
graduates[mcid == mcid_we_want]
#> mcid program
#> <char> <char>
#> 1: MCID3111150194 ISE
Example 2. The student is a starter in Electrical
Engineering (EE). They are excluded from the graduation rate
starter-graduate bloc because they did not complete EE. From
degree
we find that they completed CIP 143501 (ISE), one of
the study programs. They are also excluded from a count of ISE graduates
because they weren’t a ISE starter.
# Same ID in different blocs
mcid_we_want <- "MCID3111235261"
starters[mcid == mcid_we_want]
#> mcid program
#> <char> <char>
#> 1: MCID3111235261 EE
graduates[mcid == mcid_we_want]
#> Empty data.table (0 rows and 2 cols): mcid,program
degree[mcid == mcid_we_want, .(mcid, cip6)]
#> mcid cip6
#> <char> <char>
#> 1: MCID3111235261 143501
Example 3. The student is a starter in Civil Engineering
(CE). They are excluded from the graduation rate starter-graduate bloc
because they did not complete CE. From degree
we find that
they completed CIP 521401 (Marketing). They would also be excluded from
a count of Marketing graduates because they weren’t a Marketing
starter.
# Same ID in different blocs
mcid_we_want <- "MCID3111158691"
starters[mcid == mcid_we_want]
#> mcid program
#> <char> <char>
#> 1: MCID3111158691 CE
graduates[mcid == mcid_we_want]
#> Empty data.table (0 rows and 2 cols): mcid,program
degree[mcid == mcid_we_want, .(mcid, cip6)]
#> mcid cip6
#> <char> <char>
#> 1: MCID3111158691 521401
Groupings
One of our grouping variables (program
) is already
included in the data frames. The next grouping variable is
bloc
to distinguish starters from graduates when the two
data frames are combined.
Add a variable. Label starters and graduates.
# For grouping by bloc
starters[, bloc := "starters"]
graduates[, bloc := "graduates"]
Join. Combine the two blocs to prepare for summarizing. A student starting and graduating in the same program now has two observations in these data: one as a starter and one as a graduate.
# Prepare for summarizing
DT <- rbindlist(list(starters, graduates))
DT
#> mcid program bloc
#> <char> <char> <char>
#> 1: MCID3111142965 EE starters
#> 2: MCID3111145102 EE starters
#> 3: MCID3111150194 ISE starters
#> ---
#> 5840: MCID3112617717 ME graduates
#> 5841: MCID3112618976 ME graduates
#> 5842: MCID3112619484 EE graduates
Add variables. Demographics from Groupings
# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]
DT
#> mcid race sex program bloc
#> <char> <char> <char> <char> <char>
#> 1: MCID3111142965 International Male EE starters
#> 2: MCID3111145102 White Male EE starters
#> 3: MCID3111150194 Black Male ISE starters
#> ---
#> 5840: MCID3112617717 International Male ME graduates
#> 5841: MCID3112618976 White Male ME graduates
#> 5842: MCID3112619484 White Male EE graduates
Note. MIDFIELD research findings are regularly grouped by program, race/ethnicity, and sex. However, applied to the practice data these groupings produce several groups with totals below the threshold we impose to preserve anonymity, introducing a number of NA values in the resulting charts and tables. These NAs are largely an artifact of applying these groupings to practice data.
Graduation rate
Summarize. Count the numbers of observations for each combination of the grouping variables.
# Count observations by group
grouping_variables <- c("bloc", "program", "race", "sex")
DT <- DT[, .N, by = grouping_variables]
setorderv(DT, grouping_variables)
DT
#> bloc program race sex N
#> <char> <char> <char> <char> <int>
#> 1: graduates CE Asian Female 4
#> 2: graduates CE Asian Male 9
#> 3: graduates CE Black Female 1
#> ---
#> 94: starters ME Other/Unknown Male 53
#> 95: starters ME White Female 146
#> 96: starters ME White Male 1225
Reshape. Transform to row-record form to set up the graduation rate calculation. Transform the N column into two columns, one for starters and one for graduates.
# Prepare to compute metric
DT <- dcast(DT, program + race + sex ~ bloc, value.var = "N", fill = 0)
DT
#> Key: <program, race, sex>
#> program race sex graduates starters
#> <char> <char> <char> <int> <int>
#> 1: CE Asian Female 4 7
#> 2: CE Asian Male 9 17
#> 3: CE Black Female 1 2
#> ---
#> 49: ME Other/Unknown Male 24 53
#> 50: ME White Female 71 146
#> 51: ME White Male 566 1225
Create a variable. Compute the metric.
# Compute metric
DT[, rate := round(100 * graduates / starters, 1)]
DT
#> Key: <program, race, sex>
#> program race sex graduates starters rate
#> <char> <char> <char> <int> <int> <num>
#> 1: CE Asian Female 4 7 57.1
#> 2: CE Asian Male 9 17 52.9
#> 3: CE Black Female 1 2 50.0
#> ---
#> 49: ME Other/Unknown Male 24 53 45.3
#> 50: ME White Female 71 146 48.6
#> 51: ME White Male 566 1225 46.2
Prepare for dissemination
Filter. To preserve the anonymity of the people involved,
we remove observations with fewer than N_threshold
graduates. With the research data, we typically set this threshold to
10; with the practice data, we demonstrate the procedure using a
threshold of 5.
# Preserve anonymity
N_threshold <- 5 # 10 for research data
DT <- DT[graduates >= N_threshold]
DT
#> Key: <program, race, sex>
#> program race sex graduates starters rate
#> <char> <char> <char> <int> <int> <num>
#> 1: CE Asian Male 9 17 52.9
#> 2: CE Hispanic Male 10 36 27.8
#> 3: CE International Female 6 12 50.0
#> ---
#> 25: ME Other/Unknown Male 24 53 45.3
#> 26: ME White Female 71 146 48.6
#> 27: ME White Male 566 1225 46.2
Recode. Readers can more readily interpret our charts and tables if the programs are unabbreviated.
# Recode values for chart and table readability
DT[, program := fcase(
program %like% "CE", "Civil",
program %like% "EE", "Electrical",
program %like% "ME", "Mechanical",
program %like% "ISE", "Industrial/Systems"
)]
DT
#> program race sex graduates starters rate
#> <char> <char> <char> <int> <int> <num>
#> 1: Civil Asian Male 9 17 52.9
#> 2: Civil Hispanic Male 10 36 27.8
#> 3: Civil International Female 6 12 50.0
#> ---
#> 25: Mechanical Other/Unknown Male 24 53 45.3
#> 26: Mechanical White Female 71 146 48.6
#> 27: Mechanical White Male 566 1225 46.2
Add a variable. We combine race/ethnicity and sex to create a combined grouping variable.
# Create a combined category
DT[, people := paste(race, sex)]
DT[, `:=`(race = NULL, sex = NULL)]
setcolorder(DT, c("program", "people"))
DT
#> program people graduates starters rate
#> <char> <char> <int> <int> <num>
#> 1: Civil Asian Male 9 17 52.9
#> 2: Civil Hispanic Male 10 36 27.8
#> 3: Civil International Female 6 12 50.0
#> ---
#> 25: Mechanical Other/Unknown Male 24 53 45.3
#> 26: Mechanical White Female 71 146 48.6
#> 27: Mechanical White Male 566 1225 46.2
Chart
Order factors. Order the levels of the categories. Code adapted from Multiway data and charts.
# Order the categories
DT <- order_multiway(DT,
quantity = "rate",
categories = c("program", "people"),
method = "percent",
ratio_of = c("graduates", "starters")
)
DT
#> program people graduates starters rate program_rate
#> <fctr> <fctr> <num> <num> <num> <num>
#> 1: Civil Asian Male 9 17 52.9 46.9
#> 2: Civil Hispanic Male 10 36 27.8 46.9
#> 3: Civil International Female 6 12 50.0 46.9
#> ---
#> 25: Mechanical Other/Unknown Male 24 53 45.3 46.2
#> 26: Mechanical White Female 71 146 48.6 46.2
#> 27: Mechanical White Male 566 1225 46.2 46.2
#> people_rate
#> <num>
#> 1: 48.8
#> 2: 32.8
#> 3: 50.0
#> ---
#> 25: 37.5
#> 26: 46.6
#> 27: 45.6
Multiway chart. Code adapted from Multiway data and charts.
The vertical reference line is the aggregate graduation rate of the program, independent of race/ethnicity and sex. A missing data marker or missing group indicates the number of graduates was below the threshold set to preserve anonymity—largely an artifact of applying these groupings to practice data.
ggplot(DT, aes(x = rate, y = people)) +
facet_wrap(vars(program), ncol = 1, as.table = FALSE) +
geom_vline(aes(xintercept = program_rate), linetype = 2, color = "gray60") +
geom_point() +
labs(x = "Graduation rate (%)", y = "") +
scale_x_continuous(limits = c(20, 90), breaks = seq(0, 100, 10))
Table
Results table. Code adapted from Multiway data and charts.
# Select variables and remove factors
display_table <- copy(DT)
display_table <- display_table[, .(program, people, rate)]
display_table[, people := as.character(people)]
display_table[, program := as.character(program)]
# Construct table
display_table <- dcast(display_table, people ~ program, value.var = "rate")
setnames(display_table,
old = c("people"),
new = c("People"),
skip_absent = TRUE
)
display_table
#> Key: <People>
#> People Civil Electrical Industrial/Systems Mechanical
#> <char> <num> <num> <num> <num>
#> 1: Asian Female NA 33.3 NA NA
#> 2: Asian Male 52.9 41.2 42.9 59.3
#> 3: Black Male NA 20.7 NA 33.3
#> 4: Hispanic Female NA NA NA 71.4
#> 5: Hispanic Male 27.8 34.3 NA 35.4
#> 6: International Female 50.0 NA NA NA
#> 7: International Male 50.0 39.7 47.1 42.3
#> 8: Other/Unknown Male 27.3 31.0 NA 45.3
#> 9: White Female 46.5 41.9 47.7 48.6
#> 10: White Male 48.3 41.1 49.3 46.2
(Optional) Format the table nearer to publication quality. Here I use the ‘gt’ package.
library(gt)
display_table |>
gt() |>
tab_caption("Table 2: Graduation rates (%) of four Engineering majors") |>
tab_options(table.font.size = "small") |>
opt_stylize(style = 1, color = "gray") |>
tab_style(
style = list(cell_fill(color = "#c7eae5")),
locations = cells_column_labels(columns = everything())
)
People | Civil | Electrical | Industrial/Systems | Mechanical |
---|---|---|---|---|
Asian Female | NA | 33.3 | NA | NA |
Asian Male | 52.9 | 41.2 | 42.9 | 59.3 |
Black Male | NA | 20.7 | NA | 33.3 |
Hispanic Female | NA | NA | NA | 71.4 |
Hispanic Male | 27.8 | 34.3 | NA | 35.4 |
International Female | 50.0 | NA | NA | NA |
International Male | 50.0 | 39.7 | 47.1 | 42.3 |
Other/Unknown Male | 27.3 | 31.0 | NA | 45.3 |
White Female | 46.5 | 41.9 | 47.7 | 48.6 |
White Male | 48.3 | 41.1 | 49.3 | 46.2 |
A value of NA indicates a group removed because the number of graduates was below the threshold set to preserve anonymity. As noted earlier, these are largely an artifact of applying these groupings to practice data.