We add grouping variables from the MIDFIELD data tables to our blocs in progress. We select these variables to provide the aggregating categories we want for a particular metric. Program labels and student demographics are two of the most common sets of grouping variables we use.
This vignette in the MIDFIELD workflow.
- Planning
- Initial processing
- Blocs
-
Groupings
- Program labels
- Demographics
-
Other variables
- Metrics
- Displays
Definitions
- bloc
-
A grouping of student-level data dealt with as a unit, for example, starters, students ever-enrolled, graduates, transfer students, traditional and non-traditional students, migrators, etc.
- grouping variables
-
Detailed information in the student-level data that further characterize a bloc of records, typically used to create bloc subsets for comparison, for example, program, race/ethnicity, sex, age, grade level, grades, etc.
Method
We join grouping variables to a bloc after initial processing (data sufficiency and degree seeking) and any other subsetting criteria that define a bloc. The two most common join operations to add grouping variables are:
Program labels using an inner join on CIPs.
Demographics using a left join on IDs.
Other variables too can be usefully joined for grouping operations.
We include examples from student
, term
, and
degree
.
Reminder. midfielddata datasets are for practice, not research.
Load data
Start. If you are writing your own script to follow along, we use these packages in this article:
Load. Practice datasets. View data dictionaries via
?student
, ?term
, ?degree
.
# Load practice data
data(student, term, degree)
Loads with midfieldr. Prepared data. View data
dictionaries via ?study_programs
,
?baseline_mcid
.
Initial processing
Select (optional). Reduce the number of columns. Code reproduced from Getting started.
# Optional. Copy of source files with all variables
source_student <- copy(student)
source_term <- copy(term)
source_degree <- copy(degree)
# Optional. Select variables required by midfieldr functions
student <- select_required(source_student)
term <- select_required(source_term)
degree <- select_required(source_degree)
Initialize. Use the term
and
student
data tables to obtain a data frame of student IDs
meeting the data sufficiency and degree-seeking criteria. Appled to the
practice data, this procedure yields the baseline_mcid
data
frame derived in Blocs and included with
midfieldr.
# Working data frame
DT <- copy(baseline_mcid)
We join a CIP variable for examples that join by CIP.
# Reusable starting state with CIP
baseline_cip <- term[DT, .(mcid, cip6), on = c("mcid")]
baseline_cip <- unique(baseline_cip)
baseline_cip
#> mcid cip6
#> <char> <char>
#> 1: MCID3111142689 090401
#> 2: MCID3111142782 260101
#> 3: MCID3111142881 450601
#> ---
#> 127347: MCID3112800920 240102
#> 127348: MCID3112800920 240199
#> 127349: MCID3112870009 240102
Program labels
At this point in a typical workflow, we have a bloc of student-level
records in progress and a data frame of program labels (similar to
study_programs
). Both data frames have a 6-digit CIP
variable to join by.
Program labels serve two main functions:
Filtering variable to finalize a bloc. For example, “starters” or “graduates” usually mean starters or graduates in specific programs. Thus a bloc procedure typically concludes with a program filter as in Ever-enrolled, Starters, or Graduates.
Grouping variable for summarizing data. Having filtered a bloc to retain records in specific programs, the program label is retained and used with other grouping variables such as race/ethnicity and sex when computing and comparing metrics. Because of its role in groupings, the program label join is developed in detail in this vignette.
Rationale for the inner join. An inner join accomplishes two tasks: adds a column of program labels to the bloc; and filters the bloc to retain only those observations with CIPs matching the desired programs.
Reset The CIP baseline data frame we preserved earlier is the intake for this section.
# Reusable starting state
DT <- copy(baseline_cip)
Filter. An inner join adds the program label and filters for matching CIPs.
# Join program labels via inner join
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT
#> cip6 program mcid
#> <char> <char> <char>
#> 1: 141001 EE MCID3111142965
#> 2: 141001 EE MCID3111145102
#> 3: 141001 EE MCID3111146537
#> ---
#> 5655: 141901 ME MCID3112641399
#> 5656: 141901 ME MCID3112641535
#> 5657: 141901 ME MCID3112698681
We can see the filtering effect by noting that the baseline data frame had 127,349 observations while the inner join on the selected programs returned 5657 observations. We can also verify the selected programs, e.g.,
# Verify program labels
sort(unique(DT$program))
#> [1] "CE" "EE" "ISE" "ME"
# Verify program CIP codes
sort(unique(DT$cip6))
#> [1] "140801" "141001" "141901" "142701" "143501"
Students can migrate between majors having different 6-digit CIP codes, but those codes might be grouped under a single program label. A common example in Engineering is the “Industrial/Systems Engineering” label we assign to following CIP codes:
- 142701 Systems Engineering
- 143501 Industrial Engineering
- 143601 Manufacturing Engineering
- 143701 Operations Research
A student migrating among these CIPs would appear in multiple rows in the current bloc, yet we would not consider their change of CIP a change of major. The next step addresses this anomaly.
Select. Drop the CIP code.
# Prepare to filter
DT[, cip6 := NULL]
DT
#> program mcid
#> <char> <char>
#> 1: EE MCID3111142965
#> 2: EE MCID3111145102
#> 3: EE MCID3111146537
#> ---
#> 5655: ME MCID3112641399
#> 5656: ME MCID3112641535
#> 5657: ME MCID3112698681
Filter. Filter for unique observations.
# Case study ever enrolled
DT <- unique(DT)
DT
#> program mcid
#> <char> <char>
#> 1: EE MCID3111142965
#> 2: EE MCID3111145102
#> 3: EE MCID3111146537
#> ---
#> 5651: ME MCID3112641399
#> 5652: ME MCID3112641535
#> 5653: ME MCID3112698681
The difference in the number of observations indicates that this example includes one student in ISE with two CIPs. We examine that student’s record below.
Closer look
Examining the records of selected students in detail.
Example 1. This student’s term
record
includes CIP 142701 (Systems Engineering) and CIP 143501 (Industrial
Engineering), both of which are majors in our combined
Industrial/Systems Engineering (ISE) major, illustrating our rationale
for filtering for unique observations by ID/program pairs and not ID/CIP
pairs.
# All terms, one ID
x <- term[mcid == "MCID3111251565", .(mcid, cip6)]
# Join case study program labels
x <- study_programs[x, on = c("cip6"), nomatch = NULL]
# Unique CIPs for this student
unique(x)
#> cip6 program mcid
#> <char> <char> <char>
#> 1: 143501 ISE MCID3111251565
#> 2: 142701 ISE MCID3111251565
Demographics
Demographic variables (race/ethnicity and sex) are regularly left-joined to blocs for grouping and summarizing.
Reset The data frame of baseline IDs is the intake for this section.
# Reusable starting state
DT <- copy(baseline_mcid)
DT
#> mcid
#> <char>
#> 1: MCID3111142689
#> 2: MCID3111142782
#> 3: MCID3111142881
#> ---
#> 76873: MCID3112785480
#> 76874: MCID3112800920
#> 76875: MCID3112870009
Select. From student
, select the join-by
variable (student ID) and the variables we want to add. By selecting
columns here, we don’t have to select columns in the join operation to
follow.
# Extract desired columns
cols_we_want <- student[, .(mcid, race, sex)]
cols_we_want
#> mcid race sex
#> <char> <char> <char>
#> 1: MCID3111142225 Asian Male
#> 2: MCID3111142283 Asian Female
#> 3: MCID3111142290 Asian Male
#> ---
#> 97553: MCID3112898894 White Female
#> 97554: MCID3112898895 White Female
#> 97555: MCID3112898940 Other/Unknown Male
Add variables. Left join two data frames, retaining all variables from both.
# Add demographics
DT <- cols_we_want[DT, on = c("mcid")]
DT
#> mcid race sex
#> <char> <char> <char>
#> 1: MCID3111142689 Hispanic Female
#> 2: MCID3111142782 Hispanic Female
#> 3: MCID3111142881 International Male
#> ---
#> 76873: MCID3112785480 White Male
#> 76874: MCID3112800920 White Female
#> 76875: MCID3112870009 White Male
Unknown race
or sex
We often want to remove records for which race/ethnicity or sex are “unknown”.
# Display values
unique(DT$race)
#> [1] "Hispanic" "International" "White" "Asian"
#> [5] "Black" "Native American" "Other/Unknown"
unique(DT$sex)
#> [1] "Female" "Male" "Unknown"
Filter. In data.table syntax, we can use
x %ilike% pattern
as a case-insensitive wrapper around
grepl()
to find matches and partial matches.
# Remove records with unknown sex, if any
x <- copy(DT)
x <- x[!sex %ilike% "unknown"]
x
#> mcid race sex
#> <char> <char> <char>
#> 1: MCID3111142689 Hispanic Female
#> 2: MCID3111142782 Hispanic Female
#> 3: MCID3111142881 International Male
#> ---
#> 76872: MCID3112785480 White Male
#> 76873: MCID3112800920 White Female
#> 76874: MCID3112870009 White Male
Removing unknown race observations is similar.
# Remove records with unknown sex, if any
x <- x[!race %ilike% "unknown"]
x
#> mcid race sex
#> <char> <char> <char>
#> 1: MCID3111142689 Hispanic Female
#> 2: MCID3111142782 Hispanic Female
#> 3: MCID3111142881 International Male
#> ---
#> 73774: MCID3112785480 White Male
#> 73775: MCID3112800920 White Female
#> 73776: MCID3112870009 White Male
Alternatively, these statements can be combined.
# Remove unknowns in either of two columns
DT <- DT[!(sex %ilike% "unknown" | race %ilike% "unknown")]
# Verify equivalence
check_equiv_frames(x, DT)
#> [1] TRUE
With “unknowns” removed, the race
and sex
values are:
Add origin
origin
is a demographic variable we use to distinguish
“domestic” students from “international” students. The variable is a
recoding of the race
variable.
Add a variable. Assuming that race/ethnicity “unknown” have been removed, we use a conditional assignment to create the “origin” variable.
# Two values for origin
x <- copy(DT)
x <- x[, origin := fifelse(race == "International", "International", "Domestic")]
x[]
#> mcid race sex origin
#> <char> <char> <char> <char>
#> 1: MCID3111142689 Hispanic Female Domestic
#> 2: MCID3111142782 Hispanic Female Domestic
#> 3: MCID3111142881 International Male International
#> ---
#> 73774: MCID3112785480 White Male Domestic
#> 73775: MCID3112800920 White Female Domestic
#> 73776: MCID3112870009 White Male Domestic
With “unknowns” removed, the origin
values are:
Add people
people
is a demographic variable we use in many of our
summaries. The variable combines the race
and
sex
variables.
Add a variable. We combine race/ethnicity and sex to create a grouping variable.
With “unknowns” removed, the people
values are:
Add people
by origin
Combining the two ideas above, again assuming that the observations on unknown race/ethnicity and sex have been removed,
# Two values for origin
x <- copy(DT)
x <- x[, origin := fifelse(race == "International", "International", "Domestic")]
# Combine with sex
x[, people := paste(origin, sex)]
# Omit unnecessary variables
x <- x[, .(mcid, people)]
x
#> mcid people
#> <char> <char>
#> 1: MCID3111142689 Domestic Female
#> 2: MCID3111142782 Domestic Female
#> 3: MCID3111142881 International Male
#> ---
#> 73774: MCID3112785480 Domestic Male
#> 73775: MCID3112800920 Domestic Female
#> 73776: MCID3112870009 Domestic Male
The possible people
values are:
Other variables
Depending on one’s research question, any number of MIDFIELD
variables might be used for grouping records. In this section we
illustrate joining other variables from student
,
term
, and degree
to a working data frame.
We use the original source files copied earlier because some
variables we want to use were removed when we applied
select_required()
.
From student
Reset Reset the working data frame.
# Reusable starting state
DT <- copy(baseline_mcid)
DT
#> mcid
#> <char>
#> 1: MCID3111142689
#> 2: MCID3111142782
#> 3: MCID3111142881
#> ---
#> 76873: MCID3112785480
#> 76874: MCID3112800920
#> 76875: MCID3112870009
The available variables in the source student
data
are:
# Variables in the practice data set
names(source_student)
#> [1] "mcid" "institution" "transfer" "hours_transfer"
#> [5] "race" "sex" "age_desc" "us_citizen"
#> [9] "home_zip" "high_school" "sat_math" "sat_verbal"
#> [13] "act_comp"
Select. Select our variables and the key (ID).
# Extract desired columns
cols_we_want <- source_student[, .(mcid, transfer, hours_transfer)]
Add variables. Left join to add new columns.
# Add desired columns
cols_we_want[DT, on = c("mcid")]
#> mcid transfer hours_transfer
#> <char> <char> <num>
#> 1: MCID3111142689 First-Time Transfer NA
#> 2: MCID3111142782 First-Time Transfer NA
#> 3: MCID3111142881 First-Time Transfer NA
#> ---
#> 76873: MCID3112785480 First-Time Transfer 1
#> 76874: MCID3112800920 First-Time Transfer 15
#> 76875: MCID3112870009 First-Time Transfer 80
From term
Reset Reset the working data frame.
# Reusable starting state
DT <- copy(baseline_mcid)
The available variables in the source term
data are:
# Variables in the practice data set
names(source_term)
#> [1] "mcid" "institution" "term"
#> [4] "cip6" "level" "standing"
#> [7] "coop" "hours_term" "hours_term_attempt"
#> [10] "hours_cumul" "hours_cumul_attempt" "gpa_term"
#> [13] "gpa_cumul"
Select. Select our variables and the key (ID).
# Extract desired columns
cols_we_want <- source_term[, .(mcid, term, hours_term, gpa_term)]
Add variables. Left join to add new columns.
# Add desired columns
cols_we_want[DT, on = c("mcid")]
#> mcid term hours_term gpa_term
#> <char> <char> <num> <num>
#> 1: MCID3111142689 19883 9 3.33
#> 2: MCID3111142782 19883 16 2.80
#> 3: MCID3111142782 19885 4 3.00
#> ---
#> 531417: MCID3112870009 19953 12 3.57
#> 531418: MCID3112870009 19954 1 4.00
#> 531419: MCID3112870009 19983 7 4.00
Rows in X with multiple matches in Y. Consistent with the left join matching rules, students in enrolled in multiple terms will have multiple rows in the joined data frame. Thus this result has 531,419 observations of 76,875 unique students.
From degree
Reset Reset the working data frame.
# Reusable starting state
DT <- copy(baseline_mcid)
DT
#> mcid
#> <char>
#> 1: MCID3111142689
#> 2: MCID3111142782
#> 3: MCID3111142881
#> ---
#> 76873: MCID3112785480
#> 76874: MCID3112800920
#> 76875: MCID3112870009
The available variables in the source degree
data
are:
# Variables in the practice data set
names(source_degree)
#> [1] "mcid" "institution" "term_degree" "cip6" "degree"
Select. Select two variables and the key (ID).
# Extract desired columns
cols_we_want <- source_degree[, .(mcid, cip6, degree)]
Add variables. Left join to add new columns.
# Add desired columns
cols_we_want[DT, on = c("mcid")]
#> mcid cip6 degree
#> <char> <char> <char>
#> 1: MCID3111142689 090401 Bachelor of Arts in Journalism
#> 2: MCID3111142782 260101 Bachelor of Science in Biological Sciences
#> 3: MCID3111142881 450601 Bachelor of Arts in Economics
#> ---
#> 76990: MCID3112785480 <NA> <NA>
#> 76991: MCID3112800920 <NA> <NA>
#> 76992: MCID3112870009 <NA> <NA>
Rows in X with no match in Y. Consistent with the left
join matching rules, students in DT
who do not graduate
will have NA values in the term_degree
and
cip6
columns of the joined data tables. Thus this result
has 76,875 unique students of whom 43,786 earned degrees.
Reusable code
Program labels preparation. The CIP baseline data frame we preserved earlier is the intake for this section.
DT <- copy(baseline_cip)
Program labels. A summary code chunk for ready reference.
In gathering a bloc of starters, the join-by variable might be
start
instead of cip6
.
# Filter by program
DT <- study_programs[DT, on = c("cip6"), nomatch = NULL]
DT[, cip6 := NULL]
DT <- unique(DT)
Demographics preparation. The data frame of baseline IDs is the intake for this section.
DT <- copy(baseline_mcid)
Demographics. A summary code chunk for ready reference.
# Join race/ethnicity and sex
cols_we_want <- student[, .(mcid, race, sex)]
DT <- cols_we_want[DT, on = c("mcid")]