7 Data basics

logo
Big Data Higgs by KamiPhuc is licensed under CC BY 2.0

7.1 Introduction

This tutorial is an introduction to data preparation using the data.table package for importing, exploring, and structuring data. Our goal is to transform, reshape, and join data frames to obtain the form we need for analysis and graphs.

Prerequisites should be completed before proceeding. After that, the tutorial should take about an hour.

As you work through the tutorial, type a line or chunk of code then File > Save and run the script.
Confirm that your result matches the tutorial result.
The exercises give you chance to practice your new skills to learn by doing (but you knew that already)!

7.2 Start a new script

Create a new script for this tutorial.

See Create a script if you need a refresher on creating, saving, and running an R script.
At the top of the script add a minimal header and install and load the packages indicated.

# Data basics 
# Name 
# Date 

# Packages used in this tutorial
library("midfieldr")
library("midfielddata")
library("data.table")
library("VIM")

# Optional code to control data.table printing
options(
  datatable.print.nrows = 10,
  datatable.print.topn = 5,
  datatable.print.class = TRUE
)

# Load midfielddata data sets to use later
data(student)
data(term) 
data(degree)

If you get an error like this one after running the script,

    Error in library("VIM") : there is no package called 'VIM'

then the package needs to be installed. If you need a refresher on installing packages, see Install CRAN packages. Once the missing package is installed, you can rerun the script.

7.3 Data sets in R

Practice data sets are included with the basic R installation and with some R packages. To list the practice data sets available in R, type in the Console,

# Type in the Console 
data()

which yields

 #> Data sets in package ‘datasets’:
 #>    
 #> AirPassengers  Monthly Airline Passenger Numbers
 #> BJsales        Sales Data with Leading Indicator
 #> BOD            Biochemical Oxygen Demand
 #> CO2            Carbon Dioxide Uptake in Grass Plants
 #> Formaldehyde   Determination of Formaldehyde
 etc.

We use the data() function to list practice datasets included in a package (if any). For example, to determine what packages are bundled with the midfieldr package, type in the Console,

# Type in the Console 
data(package = "midfieldr")

which yields

 #> Data sets in package ‘midfieldr’:
 #>   
 #> cip                   Table of academic programs
 #> fye_start             Starting programs imputed for FYE students
 #> study_grad_rate       Case-study graduation rate results
 #> study_program         Case-study programs
 #> study_stickiness      Case-study stickiness results
 #> study_student         Case-study students
 #> toy_course            Course data for examples
 #> toy_degree            Degree data for examples
 #> toy_student           Student data for examples
 #> toy_term              Term data for examples

Every data set in base R and in R packages has a help page that describes the data format and variable names. The data help page can be accessed using help(), for example,

# Type in the Console 
help(cip, package = "midfieldr")

Alternatively, if the package is loaded, you may run the ? item-name syntax in the Console,

# type in the Console
library("midfieldr")
? cip

yields

starwars help page

The midfieldr “toy” data sets are small subsets by row and column of the MIDFIELD tables that we use later in the vignette examples. You can view their help pages by running, e.g., ? toy_student, or by clicking the links to the same information online.

7.3.1 Exercises

Determine the names of the datasets available in the midfielddata package.

Check your work
Determine the variables in the midfieldr study_stickiness data.

Check your work
Determine the variables in midfielddata degree data.

Check your work

7.4 Data structure

When we encounter a data set, the first step is to characterize its structure including,

class: The class of the R data object, for example, vector, matrix, data frame, time series, list, etc. The R data frame is the structure we use most often, typically in the enhanced data.table form. For data frames, we are also interested in the class of each column.
layout: How tables, rows, and columns are organized and keyed with unique and invariant coordinates to data cells. Most data processing steps have a preferred format, so we have to be able to identify and transform layouts as needed.

The most common data layout we use is the denormalized form (what many R users recognize as the “tidy” form) with variables in columns, observations in rows, and a clear taxonomy of column roles: some columns are keys (i.e., coordinates that uniquely identify the row), one or more columns are data (called “payload” in SQL), and some columns may be derived (functions of other columns).

For example, if we examine the first 5 rows of the term data (ordered by ID and term),

term[order(mcid, term)][1:5]
#>           mcid   institution   term   cip6        level      standing   coop
#>         <char>        <char> <char> <char>       <char>        <char> <char>
#> 1: MID25783135 Institution M  19911 520101  01 Freshman Good Standing     No
#> 2: MID25783135 Institution M  19913 520101 02 Sophomore Good Standing     No
#> 3: MID25783147 Institution M  20041 131202  01 Freshman Good Standing     No
#> 4: MID25783147 Institution M  20043 131210 02 Sophomore Good Standing     No
#> 5: MID25783147 Institution M  20051 131210 02 Sophomore Good Standing     No
#>    hours_term hours_term_attempt hours_cumul hours_cumul_attempt gpa_term
#>         <num>              <num>       <num>               <num>    <num>
#> 1:         19                 19          19                  19     2.47
#> 2:         14                 14          33                  33     3.21
#> 3:         17                 13          17                  13     3.76
#> 4:         18                 18          35                  31     2.67
#> 5:         15                 15          50                  46     3.20
#>    gpa_cumul
#>        <num>
#> 1:      2.47
#> 2:      2.79
#> 3:      3.76
#> 4:      3.13
#> 5:      3.15

Keys are mcid and term.
Column institution is derived (assuming the student is affiliated with one institution only) because it is a function of student ID.
The remaining variables are data.
Column gpa_cumul is class numeric (double-precision); all other columns are class character (strings).

Applying a similar procedure to the aggregated study_stickiness data set loaded with midfieldr,

study_stickiness
#>        program            race    sex  ever  grad stick
#>         <char>          <char> <char> <int> <int> <num>
#>  1:      Civil           Asian Female    17    12  70.6
#>  2:      Civil           Black Female    54    28  51.9
#>  3:      Civil           White Female   329   232  70.5
#>  4:      Civil           Asian   Male    37    24  64.9
#>  5:      Civil           Black   Male    98    43  43.9
#> ---                                                    
#> 34: Mechanical Hispanic/Latinx   Male    76    47  61.8
#> 35: Mechanical   International   Male    37    19  51.4
#> 36: Mechanical Native American   Male    14     8  57.1
#> 37: Mechanical   Other/Unknown   Male    48    28  58.3
#> 38: Mechanical           White   Male  1940  1265  65.2

Keys are program, race, and sex.
Number enrolled (ever) and number graduating (grad) are data.
Program stickiness (stick) is derived (the ratio of grad to ever).
The key columns are class character; ever and grad are class integer; and stick is class numeric (double-precision).

7.4.1 Exercises

The following midfieldr/midfielddata data frames are all in denormalized form. For each, identify their key(s), data, and derived (if any) columns.

cip
study_student
student
course
degree

7.5 What is a data.table?

In our tutorials and in our package functions, we manipulate data using the data.table package, primarily for its speed with large tables but also for its concise syntax and its zero dependencies. Thus, when you use a midfieldr function to operate on a data frame, the returned data frame is also a data.table object.

For R beginners, we try to keep our use of data.table in tutorials to an introductory level, though we will occasionally use a more complex syntax when the benefit is worth the cognitive load.

What is a data.table? The data.table cheat sheet defines data.table this way:

data.table is an extremely fast and memory efficient package for
transforming data in R. The basics of working with data.tables are:
    DT[i,  j,  by]    

Here i can subset or reorder rows, j is used to select, summarize, compute with, or create or delete columns, and by is the grouping operator.

A data.table is also a data.frame—functions that work with a data.frame therefore also work with a data.table.

7.6 Subset rows

Summary: In a data.table DT[i, j, by], using i to keep or drop rows.

Typically we create a Boolean statement to subset rows. A row is retained if the statement yields TRUE for that row and dropped in the statement yields FALSE. Thus the basic idea for row subsetting is to create a vector of TRUE and FALSE values the same length as there are rows in the data frame.

7.6.1 Logical operators

The commonly used Boolean operators are equal to (==), not equal to (!=), and (&), or (|), less than (<), greater than (>), less than or equal to (<=), and greater than or equal to (>=).

For example, if we wanted to subset toy_course for all rows for a particular student, we write (in data.table fashion)

# Subset course data for one student by their ID
DT <- toy_course[mcid == "MID25783939"]

# Examine the result
DT
#>            mcid   institution   term abbrev number  grade
#>          <char>        <char> <char> <char> <char> <char>
#>  1: MID25783939 Institution M  19991    AVS    202      B
#>  2: MID25783939 Institution M  19991    AVS    100      A
#>  3: MID25783939 Institution M  19991   ENGL    101      B
#>  4: MID25783939 Institution M  19991     CH    101      B
#>  5: MID25783939 Institution M  19991   BIOL    103      B
#> ---                                                      
#> 48: MID25783939 Institution M  20025   SPCH    250      A
#> 49: MID25783939 Institution M  20031    SOC    235      B
#> 50: MID25783939 Institution M  20031  AP EC    351      A
#> 51: MID25783939 Institution M  20031    AVS    406      A
#> 52: MID25783939 Institution M  20031   PHYS    208      B

We often construct the Boolean statement on its own line. To illustrate, the following lines produce equivalent results,

# The subsetting line we used above
DT <- toy_course[mcid == "MID25783939"]

# Equivalent form
rows_we_want <- term$mcid == "MID25783939"
DT <- toy_course[rows_we_want]

Note that in the second case we must use the df$col syntax, where df is the name of the data.frame and col is the name of the column in the data frame. This approach can be useful when the logical statement has multiple clauses, e.g., subset rows with CIP code 140102 at Institution M,

# Create a TRUE/FALSE vector with as many entries as term has rows 
rows_we_want <- toy_term$cip6 == "140102" & 
  toy_term$institution == "Institution M"

# Examine the result
str(rows_we_want)
#>  logi [1:169] FALSE FALSE FALSE FALSE FALSE FALSE ...

# Confirm the TRUE/FALSE vector has as many entries as term has rows 
nrow(toy_term)
#> [1] 169

A row is retained if the logical statement yields TRUE for that row and dropped in the statement yields FALSE.

# Retain rows for which rows_we_want is TRUE
DT <- toy_term[rows_we_want]

# examine the result
DT
#>            mcid   institution   term   cip6        level hours_term
#>          <char>        <char> <char> <char>       <char>      <num>
#>  1: MID25828870 Institution M  19881 140102  01 Freshman         17
#>  2: MID25828870 Institution M  19883 140102 02 Sophomore         14
#>  3: MID25828870 Institution M  19891 140102 02 Sophomore         14
#>  4: MID25846316 Institution M  19911 140102  01 Freshman         16
#>  5: MID25846316 Institution M  19913 140102 02 Sophomore         14
#> ---                                                                
#>  8: MID25847220 Institution M  19896 140102 02 Sophomore          6
#>  9: MID25848589 Institution M  19901 140102  01 Freshman         17
#> 10: MID25848589 Institution M  19903 140102 02 Sophomore         18
#> 11: MID25852023 Institution M  20051 140102  01 Freshman         16
#> 12: MID25852023 Institution M  20053 140102 02 Sophomore         17

Logical expressions can have multiple clauses, for example, suppose we want all degree data for Institution A from 2000 through 2002.

rows_we_want <- degree$institution == "Institution A" & 
  degree$term >= "20001" &
  degree$term <= "20026"
degree[rows_we_want]
#>             mcid   institution   term   cip6            degree
#>           <char>        <char> <char> <char>            <char>
#>   1: MID26134374 Institution A  20013 513801 Bachelor's Degree
#>   2: MID26135199 Institution A  20004 510701 Bachelor's Degree
#>   3: MID26135592 Institution A  20004 510701 Bachelor's Degree
#>   4: MID26136876 Institution A  20023 451001 Bachelor's Degree
#>   5: MID26136991 Institution A  20011 520201 Bachelor's Degree
#>  ---                                                          
#> 308: MID26353580 Institution A  20021 520201 Bachelor's Degree
#> 309: MID26353683 Institution A  20023 400508 Bachelor's Degree
#> 310: MID26355018 Institution A  20021 451001 Bachelor's Degree
#> 311: MID26355315 Institution A  20003 540101 Bachelor's Degree
#> 312: MID26355582 Institution A  20013 140901 Bachelor's Degree

7.6.2 Character matching

We regularly use the data.table functions %ilike% and %chin% to create logical expressions for subsetting rows.

%ilike% is a convenience function for calling grep(), the base R function for character pattern matching. The i in %ilike% is the version that ignores case. View its help page by running,

? `%ilike%`

For example, we often subset student data to omit populations with ambiguous race/ethnicity values such as International and Other/Unknown, as follows.

DT <- toy_student

# Examine the unique values of race/ethnicity
sort(unique(DT$race))
#> [1] "Asian"           "Black"           "Hispanic/Latinx" "International"  
#> [5] "Native American" "Other/Unknown"   "White"

To identify the rows with “Other/Unknown” or “International,” we construct a logical statement in which the search terms are in a single quoted character string separated by the logical OR | operator. Because %ilike% is based on grep(), partial matching is OK, so we don’t have to spell out the full strings.

# Logical vector that returns TRUE for the two values
rows_we_omit <- DT$race %ilike% "Other|Internat"

# Examine the logical vector 
rows_we_omit
#>   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [13] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [37]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [97] FALSE  TRUE FALSE FALSE

# Return the rows for Other and International 
DT[rows_we_omit]
#>           mcid   institution              transfer hours_transfer          race
#>         <char>        <char>                <char>          <num>        <char>
#> 1: MID25841465 Institution M First-Time in College             NA Other/Unknown
#> 2: MID25853332 Institution B First-Time in College             NA International
#> 3: MID25887008 Institution B First-Time in College             NA Other/Unknown
#> 4: MID26062203 Institution C   First-Time Transfer              7 Other/Unknown
#> 5: MID26663803 Institution L First-Time in College             NA International
#>       sex
#>    <char>
#> 1:   Male
#> 2:   Male
#> 3:   Male
#> 4: Female
#> 5:   Male

We see that in the toy student data, we have five students recorded as International or Other/Unknown. However, we want to omit these students from the data. To get the rows we want, we simply negate the rows_we_omit logical vector using the ! operator.

# TRUE becomes FALSE; FALSE becomes TRUE
rows_we_want <- !rows_we_omit

# Examine the new logical vector 
rows_we_want
#>   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
#>  [13]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
#>  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [37] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [97]  TRUE FALSE  TRUE  TRUE

# Equivalently
rows_we_want <- !DT$race %ilike% "Other|Internat"

# Return the rows we want
DT[rows_we_want]
#>            mcid   institution              transfer hours_transfer
#>          <char>        <char>                <char>          <num>
#>  1: MID25783939 Institution M First-Time in College             NA
#>  2: MID25784402 Institution M First-Time in College             NA
#>  3: MID25805538 Institution M First-Time in College             NA
#>  4: MID25808099 Institution M First-Time in College             NA
#>  5: MID25816437 Institution M First-Time in College             NA
#> ---                                                               
#> 91: MID26655230 Institution L First-Time in College             NA
#> 92: MID26656134 Institution L First-Time in College             NA
#> 93: MID26656367 Institution L First-Time in College             NA
#> 94: MID26678321 Institution L First-Time in College             NA
#> 95: MID26692008 Institution L First-Time in College             NA
#>                race    sex
#>              <char> <char>
#>  1:           White Female
#>  2:           White   Male
#>  3:           White Female
#>  4:           White Female
#>  5:           White   Male
#> ---                       
#> 91:           White Female
#> 92: Native American   Male
#> 93: Hispanic/Latinx   Male
#> 94:           White Female
#> 95:           White   Male

Of course, the logical statement for row subsetting can be placed within the brackets, and we replace the DT$race form with the column name race alone, yielding the same result.

# Equivalent statement inside the []
DT[!race %ilike% "Other|Internat"]

Another character matching function we use regularly for subsetting rows is the data.table function %chin%, a fast versions of match(), %in% and order(), optimized for character vectors. %chin% operates on two character vectors, returning a vector of the positions of (first) matches of its first argument in its second. View its help page by running,

? `%chin%`

To illustrate %chin% usage, we do the same subset by row that we did above with %ilike%. The first character vector is the race column in DT. The second vector is the vector of terms we are matching to, “International and”Other/Unknown”, that must be written in full (no partial matching) and are in a character vector (elements separated by commas).

DT <- toy_student

# Subset rows using %chin% 
rows_we_omit <- DT$race %chin% c("International", "Other/Unknown")

# use the T/F vector to subset the data frame by row
DT[!rows_we_omit]
#>            mcid   institution              transfer hours_transfer
#>          <char>        <char>                <char>          <num>
#>  1: MID25783939 Institution M First-Time in College             NA
#>  2: MID25784402 Institution M First-Time in College             NA
#>  3: MID25805538 Institution M First-Time in College             NA
#>  4: MID25808099 Institution M First-Time in College             NA
#>  5: MID25816437 Institution M First-Time in College             NA
#> ---                                                               
#> 91: MID26655230 Institution L First-Time in College             NA
#> 92: MID26656134 Institution L First-Time in College             NA
#> 93: MID26656367 Institution L First-Time in College             NA
#> 94: MID26678321 Institution L First-Time in College             NA
#> 95: MID26692008 Institution L First-Time in College             NA
#>                race    sex
#>              <char> <char>
#>  1:           White Female
#>  2:           White   Male
#>  3:           White Female
#>  4:           White Female
#>  5:           White   Male
#> ---                       
#> 91:           White Female
#> 92: Native American   Male
#> 93: Hispanic/Latinx   Male
#> 94:           White Female
#> 95:           White   Male

Of course, the logical operation can be inside the [] as well.

# Equivalent statement inside the []
DT[!race %chin% c("International", "Other/Unknown")]

In addition, it is sometimes useful to have the second vector named on its own, e.g.,

# Another equivalent statement
values_to_drop <- c("International", "Other/Unknown")
DT[!race %chin% values_to_drop]

7.7 Select columns

Summary: In a data.table DT[i, j, by], using j to keep, drop or reorder columns.

# Print out the names of the columns
names(toy_student)
#> [1] "mcid"           "institution"    "transfer"       "hours_transfer"
#> [5] "race"           "sex"

# Select columns to keep
toy_student[, list(mcid, race, sex)]
#>             mcid            race    sex
#>           <char>          <char> <char>
#>   1: MID25783939           White Female
#>   2: MID25784402           White   Male
#>   3: MID25805538           White Female
#>   4: MID25808099           White Female
#>   5: MID25816437           White   Male
#>  ---                                   
#>  96: MID26656134 Native American   Male
#>  97: MID26656367 Hispanic/Latinx   Male
#>  98: MID26663803   International   Male
#>  99: MID26678321           White Female
#> 100: MID26692008           White   Male

# Equivalent
toy_student[, .(mcid, race, sex)]
#>             mcid            race    sex
#>           <char>          <char> <char>
#>   1: MID25783939           White Female
#>   2: MID25784402           White   Male
#>   3: MID25805538           White Female
#>   4: MID25808099           White Female
#>   5: MID25816437           White   Male
#>  ---                                   
#>  96: MID26656134 Native American   Male
#>  97: MID26656367 Hispanic/Latinx   Male
#>  98: MID26663803   International   Male
#>  99: MID26678321           White Female
#> 100: MID26692008           White   Male

When used for a data.table column operation, the notation .() is a shorthand substitute for list(). A third option that we use regularly is placing the quoted column names in a character vector, e.g.,

cols_we_want <- c("mcid", "race", "sex")

and using the “double-dot” notation shown below (..cols_we_want) to subset the columns,

toy_student[, ..cols_we_want]
#>             mcid            race    sex
#>           <char>          <char> <char>
#>   1: MID25783939           White Female
#>   2: MID25784402           White   Male
#>   3: MID25805538           White Female
#>   4: MID25808099           White Female
#>   5: MID25816437           White   Male
#>  ---                                   
#>  96: MID26656134 Native American   Male
#>  97: MID26656367 Hispanic/Latinx   Male
#>  98: MID26663803   International   Male
#>  99: MID26678321           White Female
#> 100: MID26692008           White   Male

this last approach is particularly useful when we have a set of columns used several times for subsetting, grouping, and joining.

7.8 Operate on columns

delete
create
apply function

7.9 Grouping

7.10 Other row operations

7.10.1 Unique rows

Both the term and course data tables have multiple rows for the same student. When subsetting data like these and we want to eliminate duplicate rows, we use unique().

na.omit()
order
unique()

7.11 Reference semantics

7.12 Reshaping

▲ top of page