7 Data basics
Big Data Higgs by KamiPhuc is licensed under CC BY 2.0
7.1 Introduction
This tutorial is an introduction to data preparation using the data.table package for importing, exploring, and structuring data. Our goal is to transform, reshape, and join data frames to obtain the form we need for analysis and graphs.
Prerequisites should be completed before proceeding. After that, the tutorial should take about an hour.
- As you work through the tutorial, type a line or chunk of code then File > Save and run the script.
- Confirm that your result matches the tutorial result.
- The exercises give you chance to practice your new skills to learn by doing (but you knew that already)!
7.2 Start a new script
Create a new script for this tutorial.
- See Create a script if you need a refresher on creating, saving, and running an R script.
- At the top of the script add a minimal header and install and load the packages indicated.
# Data basics
# Name
# Date
# Packages used in this tutorial
library("midfieldr")
library("midfielddata")
library("data.table")
library("VIM")
# Optional code to control data.table printing
options(
datatable.print.nrows = 10,
datatable.print.topn = 5,
datatable.print.class = TRUE
)
# Load midfielddata data sets to use later
data(student)
data(term)
data(degree)
If you get an error like this one after running the script,
Error in library("VIM") : there is no package called 'VIM'
then the package needs to be installed. If you need a refresher on installing packages, see Install CRAN packages. Once the missing package is installed, you can rerun the script.
7.3 Data sets in R
Practice data sets are included with the basic R installation and with some R packages. To list the practice data sets available in R, type in the Console,
# Type in the Console
data()
which yields
#> Data sets in package ‘datasets’:
#>
#> AirPassengers Monthly Airline Passenger Numbers
#> BJsales Sales Data with Leading Indicator
#> BOD Biochemical Oxygen Demand
#> CO2 Carbon Dioxide Uptake in Grass Plants
#> Formaldehyde Determination of Formaldehyde
etc.
We use the data()
function to list practice datasets included in a package (if any). For example, to determine what packages are bundled with the midfieldr package, type in the Console,
# Type in the Console
data(package = "midfieldr")
which yields
#> Data sets in package ‘midfieldr’:
#>
#> cip Table of academic programs
#> fye_start Starting programs imputed for FYE students
#> study_grad_rate Case-study graduation rate results
#> study_program Case-study programs
#> study_stickiness Case-study stickiness results
#> study_student Case-study students
#> toy_course Course data for examples
#> toy_degree Degree data for examples
#> toy_student Student data for examples
#> toy_term Term data for examples
Every data set in base R and in R packages has a help page that describes the data format and variable names. The data help page can be accessed using help()
, for example,
# Type in the Console
help(cip, package = "midfieldr")
Alternatively, if the package is loaded, you may run the ? item-name
syntax in the Console,
# type in the Console
library("midfieldr")
? cip
yields
The midfieldr “toy” data sets are small subsets by row and column of the MIDFIELD tables that we use later in the vignette examples. You can view their help pages by running, e.g., ? toy_student
, or by clicking the links to the same information online.
7.3.1 Exercises
Determine the names of the datasets available in the midfielddata package.
Determine the variables in the midfieldr
study_stickiness
data.Determine the variables in midfielddata
degree
data.
7.4 Data structure
When we encounter a data set, the first step is to characterize its structure including,
- class
- The class of the R data object, for example, vector, matrix, data frame, time series, list, etc. The R data frame is the structure we use most often, typically in the enhanced
data.table
form. For data frames, we are also interested in the class of each column. - layout
- How tables, rows, and columns are organized and keyed with unique and invariant coordinates to data cells. Most data processing steps have a preferred format, so we have to be able to identify and transform layouts as needed.
The most common data layout we use is the denormalized form (what many R users recognize as the “tidy” form) with variables in columns, observations in rows, and a clear taxonomy of column roles: some columns are keys (i.e., coordinates that uniquely identify the row), one or more columns are data (called “payload” in SQL), and some columns may be derived (functions of other columns).
For example, if we examine the first 5 rows of the term
data (ordered by ID and term),
order(mcid, term)][1:5]
term[#> mcid institution term cip6 level standing coop
#> <char> <char> <char> <char> <char> <char> <char>
#> 1: MID25783135 Institution M 19911 520101 01 Freshman Good Standing No
#> 2: MID25783135 Institution M 19913 520101 02 Sophomore Good Standing No
#> 3: MID25783147 Institution M 20041 131202 01 Freshman Good Standing No
#> 4: MID25783147 Institution M 20043 131210 02 Sophomore Good Standing No
#> 5: MID25783147 Institution M 20051 131210 02 Sophomore Good Standing No
#> hours_term hours_term_attempt hours_cumul hours_cumul_attempt gpa_term
#> <num> <num> <num> <num> <num>
#> 1: 19 19 19 19 2.47
#> 2: 14 14 33 33 3.21
#> 3: 17 13 17 13 3.76
#> 4: 18 18 35 31 2.67
#> 5: 15 15 50 46 3.20
#> gpa_cumul
#> <num>
#> 1: 2.47
#> 2: 2.79
#> 3: 3.76
#> 4: 3.13
#> 5: 3.15
- Keys are
mcid
andterm
.
- Column
institution
is derived (assuming the student is affiliated with one institution only) because it is a function of student ID. - The remaining variables are data.
- Column
gpa_cumul
is class numeric (double-precision); all other columns are class character (strings).
Applying a similar procedure to the aggregated study_stickiness
data set loaded with midfieldr,
study_stickiness#> program race sex ever grad stick
#> <char> <char> <char> <int> <int> <num>
#> 1: Civil Asian Female 17 12 70.6
#> 2: Civil Black Female 54 28 51.9
#> 3: Civil White Female 329 232 70.5
#> 4: Civil Asian Male 37 24 64.9
#> 5: Civil Black Male 98 43 43.9
#> ---
#> 34: Mechanical Hispanic/Latinx Male 76 47 61.8
#> 35: Mechanical International Male 37 19 51.4
#> 36: Mechanical Native American Male 14 8 57.1
#> 37: Mechanical Other/Unknown Male 48 28 58.3
#> 38: Mechanical White Male 1940 1265 65.2
- Keys are
program
,race
, andsex
. - Number enrolled (
ever
) and number graduating (grad
) are data. - Program stickiness (
stick
) is derived (the ratio ofgrad
toever
). - The key columns are class character;
ever
andgrad
are class integer; andstick
is class numeric (double-precision).
7.4.1 Exercises
The following midfieldr/midfielddata data frames are all in denormalized form. For each, identify their key(s), data, and derived (if any) columns.
cip
study_student
student
course
degree
7.5 What is a data.table?
In our tutorials and in our package functions, we manipulate data using the data.table package, primarily for its speed with large tables but also for its concise syntax and its zero dependencies. Thus, when you use a midfieldr function to operate on a data frame, the returned data frame is also a data.table object.
For R beginners, we try to keep our use of data.table in tutorials to an introductory level, though we will occasionally use a more complex syntax when the benefit is worth the cognitive load.
What is a data.table? The data.table cheat sheet defines data.table this way:
data.table is an extremely fast and memory efficient package for
transforming data in R. The basics of working with data.tables are:DT[i, j, by]
Here i
can subset or reorder rows, j
is used to select, summarize, compute with, or create or delete columns, and by
is the grouping operator.
A data.table
is also a data.frame
—functions that work with a data.frame
therefore also work with a data.table
.
7.6 Subset rows
Summary: In a data.table DT[i, j, by]
, using i
to keep or drop rows.
Typically we create a Boolean statement to subset rows. A row is retained if the statement yields TRUE for that row and dropped in the statement yields FALSE. Thus the basic idea for row subsetting is to create a vector of TRUE and FALSE values the same length as there are rows in the data frame.
7.6.1 Logical operators
The commonly used Boolean operators are equal to (==
), not equal to (!=
), and (&
), or (|
), less than (<
), greater than (>
), less than or equal to (<=
), and greater than or equal to (>=
).
For example, if we wanted to subset toy_course
for all rows for a particular student, we write (in data.table fashion)
# Subset course data for one student by their ID
<- toy_course[mcid == "MID25783939"]
DT
# Examine the result
DT#> mcid institution term abbrev number grade
#> <char> <char> <char> <char> <char> <char>
#> 1: MID25783939 Institution M 19991 AVS 202 B
#> 2: MID25783939 Institution M 19991 AVS 100 A
#> 3: MID25783939 Institution M 19991 ENGL 101 B
#> 4: MID25783939 Institution M 19991 CH 101 B
#> 5: MID25783939 Institution M 19991 BIOL 103 B
#> ---
#> 48: MID25783939 Institution M 20025 SPCH 250 A
#> 49: MID25783939 Institution M 20031 SOC 235 B
#> 50: MID25783939 Institution M 20031 AP EC 351 A
#> 51: MID25783939 Institution M 20031 AVS 406 A
#> 52: MID25783939 Institution M 20031 PHYS 208 B
We often construct the Boolean statement on its own line. To illustrate, the following lines produce equivalent results,
# The subsetting line we used above
<- toy_course[mcid == "MID25783939"]
DT
# Equivalent form
<- term$mcid == "MID25783939"
rows_we_want <- toy_course[rows_we_want] DT
Note that in the second case we must use the df$col
syntax, where df
is the name of the data.frame and col
is the name of the column in the data frame. This approach can be useful when the logical statement has multiple clauses, e.g., subset rows with CIP code 140102 at Institution M,
# Create a TRUE/FALSE vector with as many entries as term has rows
<- toy_term$cip6 == "140102" &
rows_we_want $institution == "Institution M"
toy_term
# Examine the result
str(rows_we_want)
#> logi [1:169] FALSE FALSE FALSE FALSE FALSE FALSE ...
# Confirm the TRUE/FALSE vector has as many entries as term has rows
nrow(toy_term)
#> [1] 169
A row is retained if the logical statement yields TRUE for that row and dropped in the statement yields FALSE.
# Retain rows for which rows_we_want is TRUE
<- toy_term[rows_we_want]
DT
# examine the result
DT#> mcid institution term cip6 level hours_term
#> <char> <char> <char> <char> <char> <num>
#> 1: MID25828870 Institution M 19881 140102 01 Freshman 17
#> 2: MID25828870 Institution M 19883 140102 02 Sophomore 14
#> 3: MID25828870 Institution M 19891 140102 02 Sophomore 14
#> 4: MID25846316 Institution M 19911 140102 01 Freshman 16
#> 5: MID25846316 Institution M 19913 140102 02 Sophomore 14
#> ---
#> 8: MID25847220 Institution M 19896 140102 02 Sophomore 6
#> 9: MID25848589 Institution M 19901 140102 01 Freshman 17
#> 10: MID25848589 Institution M 19903 140102 02 Sophomore 18
#> 11: MID25852023 Institution M 20051 140102 01 Freshman 16
#> 12: MID25852023 Institution M 20053 140102 02 Sophomore 17
Logical expressions can have multiple clauses, for example, suppose we want all degree data for Institution A from 2000 through 2002.
<- degree$institution == "Institution A" &
rows_we_want $term >= "20001" &
degree$term <= "20026"
degree
degree[rows_we_want]#> mcid institution term cip6 degree
#> <char> <char> <char> <char> <char>
#> 1: MID26134374 Institution A 20013 513801 Bachelor's Degree
#> 2: MID26135199 Institution A 20004 510701 Bachelor's Degree
#> 3: MID26135592 Institution A 20004 510701 Bachelor's Degree
#> 4: MID26136876 Institution A 20023 451001 Bachelor's Degree
#> 5: MID26136991 Institution A 20011 520201 Bachelor's Degree
#> ---
#> 308: MID26353580 Institution A 20021 520201 Bachelor's Degree
#> 309: MID26353683 Institution A 20023 400508 Bachelor's Degree
#> 310: MID26355018 Institution A 20021 451001 Bachelor's Degree
#> 311: MID26355315 Institution A 20003 540101 Bachelor's Degree
#> 312: MID26355582 Institution A 20013 140901 Bachelor's Degree
7.6.2 Character matching
We regularly use the data.table functions %ilike%
and %chin%
to create logical expressions for subsetting rows.
%ilike%
is a convenience function for calling grep()
, the base R function for character pattern matching. The i
in %ilike%
is the version that ignores case. View its help page by running,
`%ilike%` ?
For example, we often subset student
data to omit populations with ambiguous race/ethnicity values such as International and Other/Unknown, as follows.
<- toy_student
DT
# Examine the unique values of race/ethnicity
sort(unique(DT$race))
#> [1] "Asian" "Black" "Hispanic/Latinx" "International"
#> [5] "Native American" "Other/Unknown" "White"
To identify the rows with “Other/Unknown” or “International,” we construct a logical statement in which the search terms are in a single quoted character string separated by the logical OR |
operator. Because %ilike%
is based on grep()
, partial matching is OK, so we don’t have to spell out the full strings.
# Logical vector that returns TRUE for the two values
<- DT$race %ilike% "Other|Internat"
rows_we_omit
# Examine the logical vector
rows_we_omit#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
#> [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [97] FALSE TRUE FALSE FALSE
# Return the rows for Other and International
DT[rows_we_omit]#> mcid institution transfer hours_transfer race
#> <char> <char> <char> <num> <char>
#> 1: MID25841465 Institution M First-Time in College NA Other/Unknown
#> 2: MID25853332 Institution B First-Time in College NA International
#> 3: MID25887008 Institution B First-Time in College NA Other/Unknown
#> 4: MID26062203 Institution C First-Time Transfer 7 Other/Unknown
#> 5: MID26663803 Institution L First-Time in College NA International
#> sex
#> <char>
#> 1: Male
#> 2: Male
#> 3: Male
#> 4: Female
#> 5: Male
We see that in the toy student data, we have five students recorded as International or Other/Unknown. However, we want to omit these students from the data. To get the rows we want, we simply negate the rows_we_omit
logical vector using the !
operator.
# TRUE becomes FALSE; FALSE becomes TRUE
<- !rows_we_omit
rows_we_want
# Examine the new logical vector
rows_we_want#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
#> [13] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
#> [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [37] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [97] TRUE FALSE TRUE TRUE
# Equivalently
<- !DT$race %ilike% "Other|Internat"
rows_we_want
# Return the rows we want
DT[rows_we_want]#> mcid institution transfer hours_transfer
#> <char> <char> <char> <num>
#> 1: MID25783939 Institution M First-Time in College NA
#> 2: MID25784402 Institution M First-Time in College NA
#> 3: MID25805538 Institution M First-Time in College NA
#> 4: MID25808099 Institution M First-Time in College NA
#> 5: MID25816437 Institution M First-Time in College NA
#> ---
#> 91: MID26655230 Institution L First-Time in College NA
#> 92: MID26656134 Institution L First-Time in College NA
#> 93: MID26656367 Institution L First-Time in College NA
#> 94: MID26678321 Institution L First-Time in College NA
#> 95: MID26692008 Institution L First-Time in College NA
#> race sex
#> <char> <char>
#> 1: White Female
#> 2: White Male
#> 3: White Female
#> 4: White Female
#> 5: White Male
#> ---
#> 91: White Female
#> 92: Native American Male
#> 93: Hispanic/Latinx Male
#> 94: White Female
#> 95: White Male
Of course, the logical statement for row subsetting can be placed within the brackets, and we replace the DT$race
form with the column name race
alone, yielding the same result.
# Equivalent statement inside the []
!race %ilike% "Other|Internat"] DT[
Another character matching function we use regularly for subsetting rows is the data.table function %chin%
, a fast versions of match()
, %in%
and order()
, optimized for character vectors. %chin%
operates on two character vectors, returning a vector of the positions of (first) matches of its first argument in its second. View its help page by running,
`%chin%` ?
To illustrate %chin%
usage, we do the same subset by row that we did above with %ilike%
. The first character vector is the race
column in DT
. The second vector is the vector of terms we are matching to, “International and”Other/Unknown”, that must be written in full (no partial matching) and are in a character vector (elements separated by commas).
<- toy_student
DT
# Subset rows using %chin%
<- DT$race %chin% c("International", "Other/Unknown")
rows_we_omit
# use the T/F vector to subset the data frame by row
!rows_we_omit]
DT[#> mcid institution transfer hours_transfer
#> <char> <char> <char> <num>
#> 1: MID25783939 Institution M First-Time in College NA
#> 2: MID25784402 Institution M First-Time in College NA
#> 3: MID25805538 Institution M First-Time in College NA
#> 4: MID25808099 Institution M First-Time in College NA
#> 5: MID25816437 Institution M First-Time in College NA
#> ---
#> 91: MID26655230 Institution L First-Time in College NA
#> 92: MID26656134 Institution L First-Time in College NA
#> 93: MID26656367 Institution L First-Time in College NA
#> 94: MID26678321 Institution L First-Time in College NA
#> 95: MID26692008 Institution L First-Time in College NA
#> race sex
#> <char> <char>
#> 1: White Female
#> 2: White Male
#> 3: White Female
#> 4: White Female
#> 5: White Male
#> ---
#> 91: White Female
#> 92: Native American Male
#> 93: Hispanic/Latinx Male
#> 94: White Female
#> 95: White Male
Of course, the logical operation can be inside the []
as well.
# Equivalent statement inside the []
!race %chin% c("International", "Other/Unknown")] DT[
In addition, it is sometimes useful to have the second vector named on its own, e.g.,
# Another equivalent statement
<- c("International", "Other/Unknown")
values_to_drop !race %chin% values_to_drop] DT[
7.7 Select columns
Summary: In a data.table DT[i, j, by]
, using j
to keep, drop or reorder columns.
# Print out the names of the columns
names(toy_student)
#> [1] "mcid" "institution" "transfer" "hours_transfer"
#> [5] "race" "sex"
# Select columns to keep
list(mcid, race, sex)]
toy_student[, #> mcid race sex
#> <char> <char> <char>
#> 1: MID25783939 White Female
#> 2: MID25784402 White Male
#> 3: MID25805538 White Female
#> 4: MID25808099 White Female
#> 5: MID25816437 White Male
#> ---
#> 96: MID26656134 Native American Male
#> 97: MID26656367 Hispanic/Latinx Male
#> 98: MID26663803 International Male
#> 99: MID26678321 White Female
#> 100: MID26692008 White Male
# Equivalent
toy_student[, .(mcid, race, sex)]#> mcid race sex
#> <char> <char> <char>
#> 1: MID25783939 White Female
#> 2: MID25784402 White Male
#> 3: MID25805538 White Female
#> 4: MID25808099 White Female
#> 5: MID25816437 White Male
#> ---
#> 96: MID26656134 Native American Male
#> 97: MID26656367 Hispanic/Latinx Male
#> 98: MID26663803 International Male
#> 99: MID26678321 White Female
#> 100: MID26692008 White Male
When used for a data.table column operation, the notation .()
is a shorthand substitute for list()
. A third option that we use regularly is placing the quoted column names in a character vector, e.g.,
<- c("mcid", "race", "sex") cols_we_want
and using the “double-dot” notation shown below (..cols_we_want
) to subset the columns,
toy_student[, ..cols_we_want]#> mcid race sex
#> <char> <char> <char>
#> 1: MID25783939 White Female
#> 2: MID25784402 White Male
#> 3: MID25805538 White Female
#> 4: MID25808099 White Female
#> 5: MID25816437 White Male
#> ---
#> 96: MID26656134 Native American Male
#> 97: MID26656367 Hispanic/Latinx Male
#> 98: MID26663803 International Male
#> 99: MID26678321 White Female
#> 100: MID26692008 White Male
this last approach is particularly useful when we have a set of columns used several times for subsetting, grouping, and joining.
7.8 Operate on columns
- delete
- create
- apply function
7.9 Grouping
- by
- .SD
7.10 Other row operations
7.10.1 Unique rows
Both the term
and course
data tables have multiple rows for the same student. When subsetting data like these and we want to eliminate duplicate rows, we use unique()
.
na.omit()
- order
- unique()