We examine the structure of the four data tables in midfielddata: number of observations, number and class of variables, representative values, and database keys.
| Table | Each row is | N students | N rows | N columns | Mb memory |
|---|---|---|---|---|---|
| course | one student per term & course | 97,555 | 3,289,532 | 12 | 324.3 |
| term | one student per term | 97,555 | 639,915 | 13 | 72.8 |
| student | one student | 97,555 | 97,555 | 13 | 17.3 |
| degree | one student per degree | 49,543 | 49,665 | 5 | 5.2 |
Load data
Start. If you are writing your own script to follow along, we use these packages in this article:
Load data tables. Data tables can be loaded individually or collectively as needed.
student
Student-level demographic information for 97,555 degree-seeking undergraduate students. Data at the “student-level” refers to information collected by undergraduate institutions about individual students at matriculation, for example, age, sex, and race/ethnicity.
The student data table has 97,555 observations of the 13
variables listed below. View the data dictionary using
?student.
Variable Class
1 mcid character
2 institution character
3 transfer character
4 hours_transfer numeric
5 race character
6 sex character
7 age_desc character
8 us_citizen character
9 home_zip character
10 high_school character
11 sat_math numeric
12 sat_verbal numeric
13 act_comp numeric
Student data are structured in row-record form, that is, information associated with a particular ID occupies a single row—one record per student.
# First few rows and last few rows
student
#> mcid institution transfer hours_transfer
#> <char> <char> <char> <num>
#> 1: MCID3111142225 Institution B First-Time Transfer NA
#> 2: MCID3111142283 Institution J First-Time Transfer NA
#> 3: MCID3111142290 Institution J First-Time Transfer NA
#> ---
#> 97553: MCID3112898894 Institution B First-Time in College NA
#> 97554: MCID3112898895 Institution B First-Time in College NA
#> 97555: MCID3112898940 Institution B First-Time in College NA
#> race sex age_desc us_citizen home_zip high_school sat_math
#> <char> <char> <char> <char> <char> <char> <num>
#> 1: Asian Male Under 25 Yes <NA> <NA> NA
#> 2: Asian Female Under 25 Yes 22020 <NA> 560
#> 3: Asian Male Under 25 Yes 23233 471872 510
#> ---
#> 97553: White Female Under 25 Yes 53716 501160 510
#> 97554: White Female Under 25 Yes 53029 500853 420
#> 97555: Other/Unknown Male Under 25 Yes 20016 090073 470
#> sat_verbal act_comp
#> <num> <num>
#> 1: NA NA
#> 2: 230 NA
#> 3: 380 NA
#> ---
#> 97553: 590 24
#> 97554: 590 32
#> 97555: 540 32Here, the ID variable is the primary key, as well as being the key common to all four data sets. We can verify that this column provides a unique key as follows:
# Check that the indicated column forms a primary key
key_col_names <- c("mcid")
wrapr::checkColsFormUniqueKeys(student, key_col_names)
#> [1] TRUE
# Primary key variable should have no missing values
sum(is.na(student$mcid))
#> [1] 0Student IDs and institution names have been anonymized to remove identifiable information.
# Anonymized IDs
sample(student$mcid, 8)
#> [1] "MCID3111312879" "MCID3112197606" "MCID3111735491" "MCID3112243871"
#> [5] "MCID3112849521" "MCID3112669477" "MCID3111754188" "MCID3112290710"
# Anonymized institutions
sort(unique(student$institution))
#> [1] "Institution B" "Institution C" "Institution J"Race/ethnicity and sex are often used as grouping variables.
# Possible values
sort(unique(student$race))
#> [1] "Asian" "Black" "Hispanic" "International"
#> [5] "Native American" "Other/Unknown" "White"
# Possible values
sort(unique(student$sex))
#> [1] "Female" "Male" "Unknown"Counts in each category.
# N by institution
student[order(institution), .N, by = "institution"]
#> institution N
#> <char> <int>
#> 1: Institution B 45660
#> 2: Institution C 26712
#> 3: Institution J 25183
# N by race
student[order(race), .N, by = "race"]
#> race N
#> <char> <int>
#> 1: Asian 4193
#> 2: Black 1860
#> 3: Hispanic 5386
#> 4: International 7354
#> 5: Native American 403
#> 6: Other/Unknown 4509
#> 7: White 73850
# N by sex
student[order(sex), .N, by = "sex"]
#> sex N
#> <char> <int>
#> 1: Female 46403
#> 2: Male 51151
#> 3: Unknown 1
course
Student-level course information for 97,555 undergraduates including course name and number, credit hours, and student grades for individual students in each term and course.
The course data table has 3,289,532 observations of the
12 variables listed below. View the data dictionary using
?course.
Variable Class
1 mcid character
2 institution character
3 term_course character
4 course character
5 abbrev character
6 number character
7 section character
8 type character
9 faculty_rank character
10 hours_course numeric
11 grade character
12 discipline_midfield character
Course data are structured in block-record form, that is, records associated with a particular ID can span multiple rows—one record per student per term per course.
# First few rows and last few rows
course
#> mcid institution term_course course
#> <char> <char> <char> <char>
#> 1: MCID3111142225 Institution B 19881 Microprocessor Lab
#> 2: MCID3111142225 Institution B 19881 Neural Signals
#> 3: MCID3111142225 Institution B 19881 Engineering Economy
#> ---
#> 3289530: MCID3112898940 Institution B 20181 Beginning Japanese 1
#> 3289531: MCID3112898940 Institution B 20181 Precalculus Mathematics
#> 3289532: MCID3112898940 Institution B 20181 Deviance In U S Society
#> abbrev number section type faculty_rank hours_course grade
#> <char> <char> <char> <char> <char> <num> <char>
#> 1: ECEN 2230 005 <NA> <NA> 1 C
#> 2: ECEN 4811 001 <NA> <NA> 3 C
#> 3: MCEN 4147 001 <NA> <NA> 3 B+
#> ---
#> 3289530: JPNS 1010 009 Face-to-Face Lecturer 5 C
#> 3289531: MATH 1150 012 Face-to-Face Lecturer 4 C-
#> 3289532: SOCY 1004 100 Face-to-Face Instructor 3 B
#> discipline_midfield
#> <char>
#> 1: Engineering: Electrical and Computer
#> 2: Engineering: Electrical and Computer
#> 3: Engineering: Mechanical
#> ---
#> 3289530: Language and Literature: Japanese
#> 3289531: Mathematics
#> 3289532: Social Sciences: SociologyHere, the key is a composite of four variables: student ID, course abbreviation and number, and term.
# Check that the indicated column forms a primary key
key_col_names <- c("mcid", "term_course", "abbrev", "number")
wrapr::checkColsFormUniqueKeys(course, key_col_names)
#> [1] TRUE
# Composite key variables should have no missing values
sum(is.na(course$mcid))
#> [1] 0
sum(is.na(course$term_course))
#> [1] 0
sum(is.na(course$abbrev))
#> [1] 0
sum(is.na(course$number))
#> [1] 0Sample of the course names. The course names do not contribute to the composite key, thus NAs are possible.
# Sample course names
sample(course$course, 10)
#> [1] "Sociology Of Aging"
#> [2] NA
#> [3] "War/Peace/Strat Defense"
#> [4] "Campaigns And Revolutions"
#> [5] NA
#> [6] "College Composition"
#> [7] "Ethics"
#> [8] NA
#> [9] "Shakespeare/Nonmajors"
#> [10] "Global Issues And International Affairs"Possible values of the faculty rank variable.
# Possible values
sort(unique(course$faculty_rank))
#> [1] "Academic Professional" "Adjunct Faculty"
#> [3] "Administrative Assistant" "Assistant Professor"
#> [5] "Associate Professor" "Clinical Faculty"
#> [7] "Graduate Assistant" "Instructor"
#> [9] "Lecturer" "Non-Academic Professional"
#> [11] "Other" "Pre/Post Doctoral"
#> [13] "Professor" "Research Faculty"
#> [15] "Senior Administrator" "Technician/Specialist"
#> [17] "Undergraduate Assistant" "Visiting Faculty"Using the course abbreviation and number keys, we can count the number of unique courses in these data.
term
Student-level term information for 135 undergraduates including program code, academic standing, and grade point average for individual students in each term.
The term data table has 639,915 observations of the 13
variables listed below. View the data dictionary using
?term.
Variable Class
1 mcid character
2 institution character
3 term character
4 cip6 character
5 level character
6 standing character
7 coop character
8 hours_term numeric
9 hours_term_attempt numeric
10 hours_cumul numeric
11 hours_cumul_attempt numeric
12 gpa_term numeric
13 gpa_cumul numeric
Term data are structured in block-record form, that is, records associated with a particular ID can span multiple rows—one record per student per term.
# First few rows and last few rows
term
#> mcid institution term cip6 level
#> <char> <char> <char> <char> <char>
#> 1: MCID3111142225 Institution B 19881 140901 01 First-year
#> 2: MCID3111142283 Institution J 19881 240102 01 First-year
#> 3: MCID3111142283 Institution J 19883 240102 01 First-year
#> ---
#> 639913: MCID3112898894 Institution B 20181 451001 01 First-year
#> 639914: MCID3112898895 Institution B 20181 302001 01 First-year
#> 639915: MCID3112898940 Institution B 20181 050103 01 First-year
#> standing coop hours_term hours_term_attempt hours_cumul
#> <char> <char> <num> <num> <num>
#> 1: Good Standing No 7 7 7
#> 2: Academic Probation No 6 6 6
#> 3: Academic Probation No 12 12 18
#> ---
#> 639913: Good Standing No 13 13 13
#> 639914: Good Standing No 18 18 18
#> 639915: Good Standing No 15 15 15
#> hours_cumul_attempt gpa_term gpa_cumul
#> <num> <num> <num>
#> 1: 7 2.56 2.56
#> 2: 6 1.85 1.85
#> 3: 18 1.93 1.90
#> ---
#> 639913: 13 3.52 3.52
#> 639914: 18 3.50 3.50
#> 639915: 15 2.18 2.18Here, the key is a composite of two variables: student ID and term.
# Check that the indicated column forms a primary key
key_col_names <- c("mcid", "term")
wrapr::checkColsFormUniqueKeys(term, key_col_names)
#> [1] TRUE
# Composite key variables should have no missing values
sum(is.na(term$mcid))
#> [1] 0
sum(is.na(term$term))
#> [1] 0Terms are encoded YYYYT, where
-
YYYYis the year at the start of the academic year, and
-
Tencodes the semester or quarter—Fall (1), Winter (2), Spring (3), and Summer (4,5, and6)—within an academic year
For example, for academic year 1995–96,
-
19951encodes Fall 95–96 -
19953encodes Spring 95–96 -
19954encodes Summer 95–96 (first session)
Different institutions supply data over different time spans.
# Range of data by institution
term[, .(min_term = min(term), max_term = max(term)), by = "institution"]
#> institution min_term max_term
#> <char> <char> <char>
#> 1: Institution B 19881 20181
#> 2: Institution J 19881 20096
#> 3: Institution C 19901 20154Programs are encoded in the cip6 variable, a 6-digit
character based on the 2010 Classification of Instructional Programs
(CIP) (NCES
2010).
# A sample of cip6 values
sort(unique(sample(term$cip6, 8)))
#> [1] "030103" "140201" "141201" "240199" "260901" "450701" "540101"Student level is used when determining timely completion terms of transfer students.
degree
Student-level degree information for 49,543 undergraduates including institution, program, term, and baccalaureate degree for each student.
The degree data table has 49,665 observations of the 5
variables listed below. View the data dictionary using
?degree.
Variable Class
1 mcid character
2 institution character
3 term_degree character
4 cip6 character
5 degree character
Degree data are structured in block-record form, that is, records associated with a particular ID can span multiple rows—one record per student per degree. Multiple degrees can occur in the same term or in different terms.
# First few rows and last few rows
degree
#> mcid institution term_degree cip6
#> <char> <char> <char> <char>
#> 1: MCID3111142225 Institution B 19881 141001
#> 2: MCID3111142290 Institution J 19921 141001
#> 3: MCID3111142294 Institution J 19903 141001
#> ---
#> 49663: MCID3112839623 Institution B 20181 160102
#> 49664: MCID3112845220 Institution B 20181 270101
#> 49665: MCID3112845673 Institution B 20174 090101
#> degree
#> <char>
#> 1: Bachelor of Science in Electrical Engineering
#> 2: Bachelor of Science in Electrical Engineering
#> 3: Bachelor of Science in Electrical Engineering
#> ---
#> 49663: Bachelor of Science in Linguistics
#> 49664: Bachelor of Science in Mathematics
#> 49665: Bachelor of Science in Speech Communication and RhetoricHere, the key is a composite of two variables: student ID and the CIP code.
# Check that the indicated column forms a primary key
key_col_names <- c("mcid", "cip6")
wrapr::checkColsFormUniqueKeys(degree, key_col_names)
#> [1] TRUE
# Composite key variables should have no missing values
sum(is.na(degree$mcid))
#> [1] 0
sum(is.na(degree$degree))
#> [1] 0This dataset contains records for graduates only, thus
degree has fewer rows that student. The
term_degree variable indicates when the student graduated
and the and cip6 indicates from which program.
Sample of the degree names.
sample(degree$degree, 10)
#> [1] "Bachelor of Science in Aerospace Engineering"
#> [2] "Bachelor of Science in Business Administration and Management"
#> [3] "Bachelor of Fine Arts in Theatre Arts"
#> [4] "Bachelor of International and Global Studies"
#> [5] "Bachelor of Science in Molecular Biology"
#> [6] "Bachelor of Arts in Political Science and Government"
#> [7] "Bachelor of Liberal Arts and Sciences"
#> [8] "Bachelor of Science in Mechanical Engineering"
#> [9] "Bachelor of Musical Arts"
#> [10] "Bachelor of Arts in Geography"The number of unique degrees in these data.
# Count the unique degrees
length(unique(degree$degree))
#> [1] 153
# Count the number of unique program codes
length(unique(degree$cip6))
#> [1] 153Number of degrees earned per student. In these practice data, 122 students earned two degrees.
# Count number of degrees by student
by_id <- degree[, .(degree_count = .N), by = "mcid"]
# Count the number of students by the degree count
by_id[, .(N_students = .N), by = "degree_count"]
#> degree_count N_students
#> <int> <int>
#> 1: 1 49421
#> 2: 2 122