In this study we present a complete case, from metric to charts, emphasizing the process of manipulating longitudinal data rather than the code, functions, or arguments. Subsequent articles deal with those details.
Outline
In any study, the data handling process depends on how you structure your data and how you define your metrics, population, blocs, programs, and groupings. In this study, we select the following:
Data. The midfieldr program codes in cip and
the midfielddata practice data tables student,
term, and degree.
Programs. Civil, Electrical, Industrial, and Mechanical Engineering.
Metric. Program stickiness: the ratio \small (S) of the number of graduates of a program \small (N_\textrm{grad}) to the number ever enrolled in the program \small (N_\textrm{ever}). This metric includes part-time students, transfers, students admitted in any term, and migrators (Ohland et al. 2012).
\small S = \frac{\small N_\textrm{grad}}{\small N_\textrm{ever}} = \frac{\small\mathrm{number\ of\ graduates\ of\ a\ program}}{\small\mathrm{number\ ever\ enrolled\ in\ the\ program}}
Records. Exclude records later than a student’s first degree term.
Population. Degree-seeking students for whom the data sufficiency constraint is satisfied. No exclusions due to part-time status, transfer status, admission term, or starting program.
Blocs. The stickiness metric requires two blocs: students ever enrolled in the programs; and timely graduates of the programs without regard to their starting programs.
Groupings. Program, race/ethnicity, and sex will be used for grouping and summarizing. Groups too small to preserve anonymity will be excluded.
Results. To group and summarize, we require a data frame with columns for bloc, program, race/ethnicity, and sex, keyed by student ID. After summarizing and computing the metric we should have columns for the metric (stickiness) and for each grouping variable (program, race/ethnicity, sex).
If you are writing your own script to follow along, we use these packages in this article:
Programs
One can start an analysis with program data or with student record
data—the choice is arbitrary. We start with programs and set the results
aside until needed. Our goal is to search the CIP data table for the
6-digit codes for our programs. The cip dataset loads with
midfieldr, and is documented in ?cip.
Finding the program codes entails some trial and error. Let’s start with Civil Engineering.
filter_cip("civil engineering", cip = cip)
#> cip6name cip6
#> <char> <char>
#> 1: Civil Engineering, General 140801
#> 2: Geotechnical Engineering 140802
#> 3: Structural Engineering 140803
#> 4: Transportation and Highway Engineering 140804
#> 5: Water Resources Engineering 140805
#> 6: Civil Engineering, Other 140899
#> 7: Civil Engineering Technology, Technician 150201
#> 8: Civil Drafting and Civil Engineering CAD, CADD 151304
#> cip4name cip4
#> <char> <char>
#> 1: Civil Engineering 1408
#> 2: Civil Engineering 1408
#> 3: Civil Engineering 1408
#> 4: Civil Engineering 1408
#> 5: Civil Engineering 1408
#> 6: Civil Engineering 1408
#> 7: Civil Engineering Technologies, Technicians 1502
#> 8: Drafting, Design Engineering Technologies, Technicians 1513
#> cip2name cip2
#> <char> <char>
#> 1: Engineering 14
#> 2: Engineering 14
#> 3: Engineering 14
#> 4: Engineering 14
#> 5: Engineering 14
#> 6: Engineering 14
#> 7: Engineering Technology 15
#> 8: Engineering Technology 15We note two things:
- The 2-digit code for Engineering is “14”
- Civil Engineering (not “technology”) has 4-digit code of “1408”
We take a first pass searching for “^14” (regular expression for “line that starts with 14”) and use that result as the cip-argument in the next pass.
engr_cip <- filter_cip("^14", cip = cip)
filter_cip("civil", cip = engr_cip)
#> cip6name cip6 cip4name cip4
#> <char> <char> <char> <char>
#> 1: Civil Engineering, General 140801 Civil Engineering 1408
#> 2: Geotechnical Engineering 140802 Civil Engineering 1408
#> 3: Structural Engineering 140803 Civil Engineering 1408
#> 4: Transportation and Highway Engineering 140804 Civil Engineering 1408
#> 5: Water Resources Engineering 140805 Civil Engineering 1408
#> 6: Civil Engineering, Other 140899 Civil Engineering 1408
#> cip2name cip2
#> <char> <char>
#> 1: Engineering 14
#> 2: Engineering 14
#> 3: Engineering 14
#> 4: Engineering 14
#> 5: Engineering 14
#> 6: Engineering 14In Engineering (at least), the 4-digit code seems to correspond to
department-level, degree-granting programs. Again using
engr_cip, we search for “electrical” and find the 4-digit
code of “1410.”
filter_cip("electrical", cip = engr_cip)
#> cip6name cip6
#> <char> <char>
#> 1: Electrical, Electronics and Communications Engineering 141001
#> 2: Laser and Optical Engineering 141003
#> 3: Telecommunications Engineering 141004
#> 4: Electrical, Electronics and Communications Engineering, Other 141099
#> cip4name cip4 cip2name
#> <char> <char> <char>
#> 1: Electrical, Electronics and Communications Engineering 1410 Engineering
#> 2: Electrical, Electronics and Communications Engineering 1410 Engineering
#> 3: Electrical, Electronics and Communications Engineering 1410 Engineering
#> 4: Electrical, Electronics and Communications Engineering 1410 Engineering
#> cip2
#> <char>
#> 1: 14
#> 2: 14
#> 3: 14
#> 4: 14Continuing in a similar fashion, we find that our programs have the following 4-digit codes:
- Civil Engineering 1408
- Electrical Engineering 1410
- Mechanical Engineering 1419
- Industrial/Systems Engineering 1427, 1435, 1436, and 1437.
We create a search string of 4-digit codes and search. We drop all
columns except the 6-digit names and 6-digit codes: cip6 is
the link from these data to the student records.
codes_we_want <- c("^1408", "^1410", "^1419", "^1427", "^1435", "^1436", "^1437")
programs <- filter_cip(codes_we_want, cip = engr_cip)
programs <- programs[, .(cip6name, cip6)]
programs
#> cip6name cip6
#> <char> <char>
#> 1: Civil Engineering, General 140801
#> 2: Geotechnical Engineering 140802
#> 3: Structural Engineering 140803
#> 4: Transportation and Highway Engineering 140804
#> 5: Water Resources Engineering 140805
#> 6: Civil Engineering, Other 140899
#> 7: Electrical, Electronics and Communications Engineering 141001
#> 8: Laser and Optical Engineering 141003
#> 9: Telecommunications Engineering 141004
#> 10: Electrical, Electronics and Communications Engineering, Other 141099
#> 11: Mechanical Engineering 141901
#> 12: Systems Engineering 142701
#> 13: Industrial Engineering 143501
#> 14: Manufacturing Engineering 143601
#> 15: Operations Research 143701The program names in cip are usually too long for
effective use—user-defined names are nearly always required. So we add a
program variable with values “CE” (Civil Engineering), “EE”
(electrical), “ME” (Mechanical), and “ISE” (Industrial/Systems
Engineering). We also abbreviate a couple of terms for a slightly more
compact display.
programs[, program := fcase(
cip6 %like% "^1408", "CE",
cip6 %like% "^1410", "EE",
cip6 %like% "^1419", "ME",
cip6 %like% c("^1427|^1435|^1436|^1437"), "ISE",
default = NA_character_
)]
programs[, cip6name := gsub("Engineering", "Engng", cip6name)]
programs[, cip6name := gsub("Communications", "Commn", cip6name)]
programs
#> cip6name cip6 program
#> <char> <char> <char>
#> 1: Civil Engng, General 140801 CE
#> 2: Geotechnical Engng 140802 CE
#> 3: Structural Engng 140803 CE
#> 4: Transportation and Highway Engng 140804 CE
#> 5: Water Resources Engng 140805 CE
#> 6: Civil Engng, Other 140899 CE
#> 7: Electrical, Electronics and Commn Engng 141001 EE
#> 8: Laser and Optical Engng 141003 EE
#> 9: Telecommunications Engng 141004 EE
#> 10: Electrical, Electronics and Commn Engng, Other 141099 EE
#> 11: Mechanical Engng 141901 ME
#> 12: Systems Engng 142701 ISE
#> 13: Industrial Engng 143501 ISE
#> 14: Manufacturing Engng 143601 ISE
#> 15: Operations Research 143701 ISEOur programs data frame is complete: 15 six-digit codes are encoded
using 4 program names The original 6-digit names act as a check on the
abbreviations we assigned. Later, as we prepare the blocs required for
the metric, we will join the program values to the blocs by
matching on cip6.
This data frame can sit in memory (or written to file) until we’re ready to filter the blocs by program.
Records
Load the data. We load three of the midfielddata data tables.
data(student, term, degree)Copy the source data. We usually copy the source data,
giving them new names (and new locations in memory), to keep them intact
while we use the original names — student,
term, and degree — to do our work. Otherwise,
the source data would be altered “by reference” as we work.
Having student, term, and
degree in our computational environment is convenient
because these names appear as the default values for arguments in some
midfieldr functions. For example, if the object degree
exists in the environment, then we can write
add_post_bacc(dframe) instead of
add_post_bacc(dframe, midfield_degree = degree).
Choose columns required by midfieldr functions. Optional, but convenient for viewing data frames at intermediate stages. We subset our data to select the minimum set of columns required by midfieldr functions.
student <- select_basic_cols(student)
term <- select_basic_cols(term)
degree <- select_basic_cols(degree)
look_at(student)
#> Classes 'data.table' and 'data.frame': 97555 obs. of 4 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142283" "MCID3111142290" "MCID"..
#> $ institution: chr "Institution B" "Institution J" "Institution J" "Institu"..
#> $ race : chr "Asian" "Asian" "Asian" "Asian" ...
#> $ sex : chr "Male" "Female" "Male" "Male" ...
look_at(term)
#> Classes 'data.table' and 'data.frame': 639915 obs. of 5 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142283" "MCID3111142283" "MCID"..
#> $ institution: chr "Institution B" "Institution J" "Institution J" "Institu"..
#> $ term : chr "19881" "19881" "19883" "19885" ...
#> $ cip6 : chr "140901" "240102" "240102" "190601" ...
#> $ level : chr "01 First-year" "01 First-year" "01 First-year" "01 Firs"..
look_at(degree)
#> Classes 'data.table' and 'data.frame': 49665 obs. of 4 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142290" "MCID3111142294" "MCID"..
#> $ institution: chr "Institution B" "Institution J" "Institution J" "Institu"..
#> $ term_degree: chr "19881" "19921" "19903" "19921" ...
#> $ cip6 : chr "141001" "141001" "141001" "141001" ...Identify rows of post-baccalaureate terms to exclude. This
step identifies terms later than the first baccalaureate, if any. We are
not generally interested in terms beyond the first degree term, so we
use the results to exclude post-first-degree terms. This step does not
apply to the student table because it contains no term
information.
This step adds two columns (first_degree_term and
term_status) to a data frame:
term <- add_post_bacc(term, midfield_degree = degree)
degree <- add_post_bacc(degree, midfield_degree = degree)
look_at(term)
#> Classes 'data.table' and 'data.frame': 639915 obs. of 7 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142283" "MCID3111142283""..
#> $ institution : chr "Institution B" "Institution J" "Institution J" "I"..
#> $ cip6 : chr "140901" "240102" "240102" "190601" ...
#> $ level : chr "01 First-year" "01 First-year" "01 First-year" "0"..
#> $ term : chr "19881" "19881" "19883" "19885" ...
#> $ first_degree_term: chr "19881" NA NA NA ...
#> $ term_status : chr "first-degree" "pre-bacc" "pre-bacc" "pre-bacc" ...
look_at(degree)
#> Classes 'data.table' and 'data.frame': 49665 obs. of 6 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142290" "MCID3111142294""..
#> $ institution : chr "Institution B" "Institution J" "Institution J" "I"..
#> $ cip6 : chr "141001" "141001" "141001" "141001" ...
#> $ term_degree : chr "19881" "19921" "19903" "19921" ...
#> $ first_degree_term: chr "19881" "19921" "19903" "19921" ...
#> $ term_status : chr "first-degree" "first-degree" "first-degree" "firs"..Summarizing by term status, we have a total of 6998
post-baccalaureate terms in term and 47 in degree.
term[, .N, by = c("term_status")]
#> term_status N
#> <char> <int>
#> 1: first-degree 34440
#> 2: pre-bacc 598477
#> 3: post-first-degree 6998
degree[, .N, by = c("term_status")]
#> term_status N
#> <char> <int>
#> 1: first-degree 49618
#> 2: post-first-degree 47We exclude rows with post-baccalaureate terms.
term <- term[term_status != "post-first-degree"]
degree <- degree[term_status != "post-first-degree"]Dropping the two temporary columns and ensuring rows are unique yields the baseline records in their final configuration.
cols_to_drop <- c("first_degree_term", "term_status")
term[, (cols_to_drop) := NULL]
degree[, (cols_to_drop) := NULL]
student <- unique(student)
term <- unique(term)
degree <- unique(degree)These three data frames are the basis of all further analysis.
look_at(student)
#> Classes 'data.table' and 'data.frame': 97555 obs. of 4 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142283" "MCID3111142290" "MCID"..
#> $ institution: chr "Institution B" "Institution J" "Institution J" "Institu"..
#> $ race : chr "Asian" "Asian" "Asian" "Asian" ...
#> $ sex : chr "Male" "Female" "Male" "Male" ...
look_at(term)
#> Classes 'data.table' and 'data.frame': 632917 obs. of 5 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142283" "MCID3111142283" "MCID"..
#> $ institution: chr "Institution B" "Institution J" "Institution J" "Institu"..
#> $ cip6 : chr "140901" "240102" "240102" "190601" ...
#> $ level : chr "01 First-year" "01 First-year" "01 First-year" "01 Firs"..
#> $ term : chr "19881" "19881" "19883" "19885" ...
look_at(degree)
#> Classes 'data.table' and 'data.frame': 49618 obs. of 4 variables:
#> $ mcid : chr "MCID3111142225" "MCID3111142290" "MCID3111142294" "MCID"..
#> $ institution: chr "Institution B" "Institution J" "Institution J" "Institu"..
#> $ cip6 : chr "141001" "141001" "141001" "141001" ...
#> $ term_degree: chr "19881" "19921" "19903" "19921" ...Population
Our goal is developing a data frame with one column of IDs that
satisfies our constraints of data sufficiency and degree-seeking. We
start with the unique IDs from the term data table we just
derived.
DT <- copy(term)
DT <- DT[, .(mcid)]
DT <- unique(DT)
DT
#> mcid
#> <char>
#> 1: MCID3111142225
#> 2: MCID3111142283
#> 3: MCID3111142290
#> 4: MCID3111142294
#> 5: MCID3111142299
#> ---
#> 97532: MCID3112898886
#> 97533: MCID3112898890
#> 97534: MCID3112898894
#> 97535: MCID3112898895
#> 97536: MCID3112898940Filter rows for data sufficiency. Data sufficiency requires the timely completion term estimate.
DT <- add_timely_term(DT, midfield_term = term)
DT
#> mcid term_i level_i adj_span timely_term
#> <char> <char> <char> <num> <char>
#> 1: MCID3111142225 19881 01 First-year 6 19933
#> 2: MCID3111142283 19881 01 First-year 6 19933
#> 3: MCID3111142290 19881 01 First-year 6 19933
#> 4: MCID3111142294 19881 01 First-year 6 19933
#> 5: MCID3111142299 19881 01 First-year 6 19933
#> ---
#> 97532: MCID3112898886 20181 01 First-year 6 20233
#> 97533: MCID3112898890 20181 01 First-year 6 20233
#> 97534: MCID3112898894 20181 01 First-year 6 20233
#> 97535: MCID3112898895 20181 01 First-year 6 20233
#> 97536: MCID3112898940 20181 01 First-year 6 20233We can drop the columns of supporting variables and add the data sufficiency columns.
cols_to_drop <- c("level_i", "adj_span")
DT[, (cols_to_drop) := NULL]
DT <- add_data_sufficiency(DT, midfield_term = term)
DT
#> mcid timely_term term_i lower_limit upper_limit
#> <char> <char> <char> <char> <char>
#> 1: MCID3111142225 19933 19881 19881 20181
#> 2: MCID3111142283 19933 19881 19881 20096
#> 3: MCID3111142290 19933 19881 19881 20096
#> 4: MCID3111142294 19933 19881 19881 20096
#> 5: MCID3111142299 19933 19881 19881 20096
#> ---
#> 97532: MCID3112898886 20233 20181 19881 20181
#> 97533: MCID3112898890 20233 20181 19881 20181
#> 97534: MCID3112898894 20233 20181 19881 20181
#> 97535: MCID3112898895 20233 20181 19881 20181
#> 97536: MCID3112898940 20233 20181 19881 20181
#> data_sufficiency
#> <char>
#> 1: exclude-lower
#> 2: exclude-lower
#> 3: exclude-lower
#> 4: exclude-lower
#> 5: exclude-lower
#> ---
#> 97532: exclude-upper
#> 97533: exclude-upper
#> 97534: exclude-upper
#> 97535: exclude-upper
#> 97536: exclude-upperSummarizing by data sufficiency, we have a total of 20,671 rows to exclude due to insufficient data, leaving a population of 76,865.
DT[, .N, by = c("data_sufficiency")]
#> data_sufficiency N
#> <char> <int>
#> 1: exclude-lower 2746
#> 2: include 76865
#> 3: exclude-upper 17925Filtering for the “include” label and dropping all but the ID column,
DT <- DT[data_sufficiency == "include", .(mcid)]
DT
#> mcid
#> <char>
#> 1: MCID3111142689
#> 2: MCID3111142782
#> 3: MCID3111142881
#> 4: MCID3111142884
#> 5: MCID3111142893
#> ---
#> 76861: MCID3112727985
#> 76862: MCID3112730841
#> 76863: MCID3112785480
#> 76864: MCID3112800920
#> 76865: MCID3112870009Filter rows for degree seeking. Using this result, we
apply an inner join with the student data table, which contains
degree-seeking students only. The result contains rows that are common
to both by ID—in effect, removing rows from DT of any
non-degree-seeking students.
DT <- student[DT, on = "mcid", nomatch = NULL]
DT
#> mcid institution race sex
#> <char> <char> <char> <char>
#> 1: MCID3111142689 Institution B Hispanic Female
#> 2: MCID3111142782 Institution J Hispanic Female
#> 3: MCID3111142881 Institution B International Male
#> 4: MCID3111142884 Institution B International Male
#> 5: MCID3111142893 Institution B International Male
#> ---
#> 76861: MCID3112727985 Institution B White Female
#> 76862: MCID3112730841 Institution B Hispanic Female
#> 76863: MCID3112785480 Institution C White Male
#> 76864: MCID3112800920 Institution B White Female
#> 76865: MCID3112870009 Institution B White MaleIt happens that all students in the practice data are degree-seeking, so this step did not reduce the size of our population. (We include the step to illustrate our complete process.)
Extracting the column of IDs yields our baseline population.
population <- DT[, .(mcid)]
population
#> mcid
#> <char>
#> 1: MCID3111142689
#> 2: MCID3111142782
#> 3: MCID3111142881
#> 4: MCID3111142884
#> 5: MCID3111142893
#> ---
#> 76861: MCID3112727985
#> 76862: MCID3112730841
#> 76863: MCID3112785480
#> 76864: MCID3112800920
#> 76865: MCID3112870009Blocs and groupings
Before computing our metric, we must group and summarize our final population. We have two blocs (from the definition of the metric):
- ever-enrolled
- graduates
and three grouping variables
- program
- race/ethnicity
- sex
We have a lot of flexibility in the order in which we construct our blocs and groupings, so what follows is only one of several effective solutions.
Timely-graduates bloc
We start with the baseline population. Like we did with the original
source data files, we copy it to protect population from
“by reference” changes.
graduates <- copy(population)Because we plan on this becoming a bloc of timely graduates, we add a
bloc variable with the value “grad”.
graduates[, bloc := "grad"]
graduates
#> mcid bloc
#> <char> <char>
#> 1: MCID3111142689 grad
#> 2: MCID3111142782 grad
#> 3: MCID3111142881 grad
#> 4: MCID3111142884 grad
#> 5: MCID3111142893 grad
#> ---
#> 76861: MCID3112727985 grad
#> 76862: MCID3112730841 grad
#> 76863: MCID3112785480 grad
#> 76864: MCID3112800920 grad
#> 76865: MCID3112870009 gradInner join with degree excludes rows of
non-completers.
graduates <- degree[graduates, .(mcid, bloc, cip6, term_degree),
on = "mcid", nomatch = NULL
]
graduates
#> mcid bloc cip6 term_degree
#> <char> <char> <char> <char>
#> 1: MCID3111142689 grad 090401 19913
#> 2: MCID3111142782 grad 260101 19903
#> 3: MCID3111142881 grad 450601 19894
#> 4: MCID3111142965 grad 141001 19901
#> 5: MCID3111143066 grad 090401 19883
#> ---
#> 43843: MCID3112693003 grad 260406 20171
#> 43844: MCID3112693979 grad 450601 20181
#> 43845: MCID3112694738 grad 230101 20143
#> 43846: MCID3112698681 grad 110701 20181
#> 43847: MCID3112730841 grad 040401 20164Filter rows for timely completion. Determining a student’s completion status requires the timely term variable.
graduates <- add_timely_term(graduates, midfield_term = term)
graduates
#> mcid bloc cip6 term_degree term_i level_i adj_span
#> <char> <char> <char> <char> <char> <char> <num>
#> 1: MCID3111142689 grad 090401 19913 19883 01 First-year 6
#> 2: MCID3111142782 grad 260101 19903 19883 01 First-year 6
#> 3: MCID3111142881 grad 450601 19894 19893 01 First-year 6
#> 4: MCID3111142965 grad 141001 19901 19883 01 First-year 6
#> 5: MCID3111143066 grad 090401 19883 19883 01 First-year 6
#> ---
#> 43843: MCID3112693003 grad 260406 20171 20093 01 First-year 6
#> 43844: MCID3112693979 grad 450601 20181 20114 01 First-year 6
#> 43845: MCID3112694738 grad 230101 20143 20103 01 First-year 6
#> 43846: MCID3112698681 grad 110701 20181 20113 01 First-year 6
#> 43847: MCID3112730841 grad 040401 20164 20121 01 First-year 6
#> timely_term
#> <char>
#> 1: 19941
#> 2: 19941
#> 3: 19951
#> 4: 19941
#> 5: 19941
#> ---
#> 43843: 20151
#> 43844: 20173
#> 43845: 20161
#> 43846: 20171
#> 43847: 20173We can drop the columns of supporting variables and add the completion status columns.
cols_to_drop <- c("term_i", "level_i", "adj_span")
graduates[, (cols_to_drop) := NULL]
graduates <- add_completion_status(graduates, midfield_degree = degree)
graduates
#> mcid bloc cip6 timely_term term_degree completion_status
#> <char> <char> <char> <char> <char> <char>
#> 1: MCID3111142689 grad 090401 19941 19913 timely
#> 2: MCID3111142782 grad 260101 19941 19903 timely
#> 3: MCID3111142881 grad 450601 19951 19894 timely
#> 4: MCID3111142965 grad 141001 19941 19901 timely
#> 5: MCID3111143066 grad 090401 19941 19883 timely
#> ---
#> 43843: MCID3112693003 grad 260406 20151 20171 late
#> 43844: MCID3112693979 grad 450601 20173 20181 late
#> 43845: MCID3112694738 grad 230101 20161 20143 timely
#> 43846: MCID3112698681 grad 110701 20171 20181 late
#> 43847: MCID3112730841 grad 040401 20173 20164 timelySummarizing by completion status, we find 40,490 rows with timely completion.
graduates[, .N, by = "completion_status"]
#> completion_status N
#> <char> <int>
#> 1: timely 40490
#> 2: late 3357Filtering for the “timely” label and keeping the ID, bloc, and CIP columns,
graduates <- graduates[completion_status == "timely", .(mcid, bloc, cip6)]
graduates
#> mcid bloc cip6
#> <char> <char> <char>
#> 1: MCID3111142689 grad 090401
#> 2: MCID3111142782 grad 260101
#> 3: MCID3111142881 grad 450601
#> 4: MCID3111142965 grad 141001
#> 5: MCID3111143066 grad 090401
#> ---
#> 40486: MCID3112675459 grad 261310
#> 40487: MCID3112675472 grad 500703
#> 40488: MCID3112692944 grad 090101
#> 40489: MCID3112694738 grad 230101
#> 40490: MCID3112730841 grad 040401We can complete this stage by filtering for programs.
graduates <- programs[graduates, .(mcid, bloc, program, cip6), on = "cip6"]
graduates
#> mcid bloc program cip6
#> <char> <char> <char> <char>
#> 1: MCID3111142689 grad <NA> 090401
#> 2: MCID3111142782 grad <NA> 260101
#> 3: MCID3111142881 grad <NA> 450601
#> 4: MCID3111142965 grad EE 141001
#> 5: MCID3111143066 grad <NA> 090401
#> ---
#> 40486: MCID3112675459 grad <NA> 261310
#> 40487: MCID3112675472 grad <NA> 500703
#> 40488: MCID3112692944 grad <NA> 090101
#> 40489: MCID3112694738 grad <NA> 230101
#> 40490: MCID3112730841 grad <NA> 040401Then drop all rows with an NA program value
graduates <- graduates[!is.na(program)]
graduates
#> mcid bloc program cip6
#> <char> <char> <char> <char>
#> 1: MCID3111142965 grad EE 141001
#> 2: MCID3111145102 grad EE 141001
#> 3: MCID3111146537 grad EE 141001
#> 4: MCID3111146674 grad EE 141001
#> 5: MCID3111150194 grad ISE 143501
#> ---
#> 3259: MCID3112618553 grad ME 141901
#> 3260: MCID3112618574 grad ME 141901
#> 3261: MCID3112618976 grad ME 141901
#> 3262: MCID3112619484 grad EE 141001
#> 3263: MCID3112641535 grad ME 141901Drop the CIP code
graduates[, cip6 := NULL]
graduates <- unique(graduates)
graduates
#> mcid bloc program
#> <char> <char> <char>
#> 1: MCID3111142965 grad EE
#> 2: MCID3111145102 grad EE
#> 3: MCID3111146537 grad EE
#> 4: MCID3111146674 grad EE
#> 5: MCID3111150194 grad ISE
#> ---
#> 3259: MCID3112618553 grad ME
#> 3260: MCID3112618574 grad ME
#> 3261: MCID3112618976 grad ME
#> 3262: MCID3112619484 grad EE
#> 3263: MCID3112641535 grad MESummarize to check results so far
graduates[, .N, by = "program"]
#> program N
#> <char> <int>
#> 1: EE 736
#> 2: ISE 238
#> 3: ME 1353
#> 4: CE 936Ever-enrolled group
Again, we start with the baseline set of IDs and we add a
bloc variable with the value “ever”.
ever_enrolled <- copy(population)
ever_enrolled[, bloc := "ever"]
ever_enrolled
#> mcid bloc
#> <char> <char>
#> 1: MCID3111142689 ever
#> 2: MCID3111142782 ever
#> 3: MCID3111142881 ever
#> 4: MCID3111142884 ever
#> 5: MCID3111142893 ever
#> ---
#> 76861: MCID3112727985 ever
#> 76862: MCID3112730841 ever
#> 76863: MCID3112785480 ever
#> 76864: MCID3112800920 ever
#> 76865: MCID3112870009 everLeft-join adds all terms
ever_enrolled <- term[ever_enrolled, .(mcid, bloc, cip6), on = "mcid"]
ever_enrolled
#> mcid bloc cip6
#> <char> <char> <char>
#> 1: MCID3111142689 ever 090401
#> 2: MCID3111142782 ever 260101
#> 3: MCID3111142782 ever 260101
#> 4: MCID3111142782 ever 260101
#> 5: MCID3111142782 ever 260101
#> ---
#> 525442: MCID3112800920 ever 240199
#> 525443: MCID3112870009 ever 240102
#> 525444: MCID3112870009 ever 240102
#> 525445: MCID3112870009 ever 240102
#> 525446: MCID3112870009 ever 240102Filter by programs
ever_enrolled <- programs[ever_enrolled, .(mcid, bloc, program, cip6), on = "cip6"]
ever_enrolled <- ever_enrolled[!is.na(program)]
ever_enrolled <- unique(ever_enrolled)
ever_enrolled
#> mcid bloc program cip6
#> <char> <char> <char> <char>
#> 1: MCID3111142965 ever EE 141001
#> 2: MCID3111145102 ever EE 141001
#> 3: MCID3111146537 ever EE 141001
#> 4: MCID3111146674 ever EE 141001
#> 5: MCID3111150194 ever ISE 143501
#> ---
#> 5579: MCID3112619484 ever EE 141001
#> 5580: MCID3112619666 ever ME 141901
#> 5581: MCID3112641399 ever ME 141901
#> 5582: MCID3112641535 ever ME 141901
#> 5583: MCID3112698681 ever ME 141901