This post covers the content and exercises for Ch 5: Data Transformation from R for Data Science. The chapter teaches how to transform data with dplyr.

5.1 Introduction

5.1.1 Prerequisites

Use data from the nycflights13 package

library(nycflights13)
library(tidyverse)
library(lubridate)

5.1.2 nycflights13

Will be performing data manipulation on nycflights13::flights

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

5.1.3 dplyr Basics

Will use the 5 common functions of dplyr

  • filter() for picking observations based on their values
  • arrange() for reordering rows
  • select() for picking variables by their name
  • mutate for creating new variables based off of existing variables
  • summarise() for collapsing many values to a single summary

Can all be used with group_by() which changes scope of each function from entire dataset to going group by group

5.2 Filter Rows with filter()

filter() allows you to subset observations based on their values.

5.2.4 Exercises

  1. Find all flights that

Had an arrival delay of two or more hours

filter(flights, arr_delay >= 120)
## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      811            630       101     1047
##  2  2013     1     1      848           1835       853     1001
##  3  2013     1     1      957            733       144     1056
##  4  2013     1     1     1114            900       134     1447
##  5  2013     1     1     1505           1310       115     1638
##  6  2013     1     1     1525           1340       105     1831
##  7  2013     1     1     1549           1445        64     1912
##  8  2013     1     1     1558           1359       119     1718
##  9  2013     1     1     1732           1630        62     2028
## 10  2013     1     1     1803           1620       103     2008
## # ... with 10,190 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Flew to Houston (IAH or HOU)

filter(flights, dest %in% c("IAH", "HOU"))
## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      623            627        -4      933
##  4  2013     1     1      728            732        -4     1041
##  5  2013     1     1      739            739         0     1104
##  6  2013     1     1      908            908         0     1228
##  7  2013     1     1     1028           1026         2     1350
##  8  2013     1     1     1044           1045        -1     1352
##  9  2013     1     1     1114            900       134     1447
## 10  2013     1     1     1205           1200         5     1503
## # ... with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Were operated by United, American, or Delta

airlines
## # A tibble: 16 x 2
##    carrier                        name
##      <chr>                       <chr>
##  1      9E           Endeavor Air Inc.
##  2      AA      American Airlines Inc.
##  3      AS        Alaska Airlines Inc.
##  4      B6             JetBlue Airways
##  5      DL        Delta Air Lines Inc.
##  6      EV    ExpressJet Airlines Inc.
##  7      F9      Frontier Airlines Inc.
##  8      FL AirTran Airways Corporation
##  9      HA      Hawaiian Airlines Inc.
## 10      MQ                   Envoy Air
## 11      OO       SkyWest Airlines Inc.
## 12      UA       United Air Lines Inc.
## 13      US             US Airways Inc.
## 14      VX              Virgin America
## 15      WN      Southwest Airlines Co.
## 16      YV          Mesa Airlines Inc.
filter(flights, carrier %in% c("AA", "DL", "UA"))
## # A tibble: 139,504 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      554            600        -6      812
##  5  2013     1     1      554            558        -4      740
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            600        -1      941
## 10  2013     1     1      559            600        -1      854
## # ... with 139,494 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Departed in summer (July, August, and September)

filter(flights, month %in% c(7, 8, 9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7     1        1           2029       212      236
##  2  2013     7     1        2           2359         3      344
##  3  2013     7     1       29           2245       104      151
##  4  2013     7     1       43           2130       193      322
##  5  2013     7     1       44           2150       174      300
##  6  2013     7     1       46           2051       235      304
##  7  2013     7     1       48           2001       287      308
##  8  2013     7     1       58           2155       183      335
##  9  2013     7     1      100           2146       194      327
## 10  2013     7     1      100           2245       135      337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Arrived more than two hours late, but didn’t leave late

filter(flights, arr_delay >120 & dep_delay <= 0)
## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    27     1419           1420        -1     1754
##  2  2013    10     7     1350           1350         0     1736
##  3  2013    10     7     1357           1359        -2     1858
##  4  2013    10    16      657            700        -3     1258
##  5  2013    11     1      658            700        -2     1329
##  6  2013     3    18     1844           1847        -3       39
##  7  2013     4    17     1635           1640        -5     2049
##  8  2013     4    18      558            600        -2     1149
##  9  2013     4    18      655            700        -5     1213
## 10  2013     5    22     1827           1830        -3     2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights, dep_delay >= 60 & arr_delay < 30)
## # A tibble: 206 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     3     1850           1745        65     2148
##  2  2013     1     3     1950           1845        65     2228
##  3  2013     1     3     2015           1915        60     2135
##  4  2013     1     6     1019            900        79     1558
##  5  2013     1     7     1543           1430        73     1758
##  6  2013     1    11     1020            920        60     1311
##  7  2013     1    12     1706           1600        66     1949
##  8  2013     1    12     1953           1845        68     2154
##  9  2013     1    19     1456           1355        61     1636
## 10  2013     1    21     1531           1430        61     1843
## # ... with 196 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Departed between midnight and 6am (inclusive)

filter(flights, dep_time == 2400 | dep_time <= 0600)
## # A tibble: 9,373 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 9,363 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

between() is a shortcut for x >= left & x <= right

flights[between(flights$dep_time, 0, 600), ]
## # A tibble: 17,599 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 17,589 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
filter(flights, is.na(dep_time))
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Also missing dep_delay, arr_time, arr_delay, air_time, they could represent canceled flights

  1. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA ^ 0 # any value ^0 is 1
## [1] 1
NA | TRUE # Logical is true since  one side is true
## [1] TRUE
FALSE & NA # Logical is false since one side is false
## [1] FALSE
NA * 0 # ???
## [1] NA

Generally, operations including NA will be NA unless it would return a specific value no matter what the NA was

5.3 Arrange Rows with arrange()

Arranges a data frame by columns specified, additional columns are used for tie breakers. desc() for descending, and NAs are always pushed to the end

5.3.1 Exercises

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
arrange(flights, desc(is.na(dep_time)))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     6    27      959           1900       899     1236
##  9  2013     7    22     2257            759       898      121
## 10  2013    12     5      756           1700       896     1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
arrange(flights, dep_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    13        1           2249        72      108
##  2  2013     1    31        1           2100       181      124
##  3  2013    11    13        1           2359         2      442
##  4  2013    12    16        1           2359         2      447
##  5  2013    12    20        1           2359         2      430
##  6  2013    12    26        1           2359         2      437
##  7  2013    12    30        1           2359         2      441
##  8  2013     2    11        1           2100       181      111
##  9  2013     2    24        1           2245        76      121
## 10  2013     3     8        1           2355         6      431
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Sort flights to find the fastest flights.
arrange(flights, air_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    16     1355           1315        40     1442
##  2  2013     4    13      537            527        10      622
##  3  2013    12     6      922            851        31     1021
##  4  2013     2     3     2153           2129        24     2247
##  5  2013     2     5     1303           1315       -12     1342
##  6  2013     2    12     2123           2130        -7     2211
##  7  2013     3     2     1450           1500       -10     1547
##  8  2013     3     8     2026           1935        51     2131
##  9  2013     3    18     1456           1329        87     1533
## 10  2013     3    19     2226           2145        41     2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Which flights travelled the longest? Which travelled the shortest?
arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      857            900        -3     1516
##  2  2013     1     2      909            900         9     1525
##  3  2013     1     3      914            900        14     1504
##  4  2013     1     4      900            900         0     1516
##  5  2013     1     5      858            900        -2     1519
##  6  2013     1     6     1019            900        79     1558
##  7  2013     1     7     1042            900       102     1620
##  8  2013     1     8      901            900         1     1504
##  9  2013     1     9      641            900      1301     1242
## 10  2013     1    10      859            900        -1     1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
arrange(flights, distance)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7    27       NA            106        NA       NA
##  2  2013     1     3     2127           2129        -2     2222
##  3  2013     1     4     1240           1200        40     1333
##  4  2013     1     4     1829           1615       134     1937
##  5  2013     1     4     2128           2129        -1     2218
##  6  2013     1     5     1155           1200        -5     1241
##  7  2013     1     6     2125           2129        -4     2224
##  8  2013     1     7     2124           2129        -5     2212
##  9  2013     1     8     2127           2130        -3     2304
## 10  2013     1     9     2126           2129        -3     2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

5.4 Select Columns with select()

Helper functions:
* starts_with("abc"): matches names that begin with “abc”.

  • ends_with("xyz"): matches names that end with “xyz”.

  • contains("ijk"): matches names that contain “ijk”.

  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

  • num_range("x", 1:3): matches x1, x2 and x3.

5.4.1 Exercises

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows
  1. What happens if you include the name of a variable multiple times in a select() call?
select(flights, dep_time, dep_time)
## # A tibble: 336,776 x 1
##    dep_time
##       <int>
##  1      517
##  2      533
##  3      542
##  4      544
##  5      554
##  6      554
##  7      555
##  8      557
##  9      557
## 10      558
## # ... with 336,766 more rows
  1. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ... with 336,766 more rows
  1. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
  • By default the helpers ignore case, this can be changed by setting ignore.case = FALSE

select(flights, contains("TIME"))

select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # ... with 336,766 more rows, and 1 more variables: time_hour <dttm>

5.5 Add new variables with mutate()

It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

5.5.2 Exercises

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
flights_midnight <- flights %>% 
  mutate(dep_time_hour = dep_time %/% 100, 
       dep_time_min = dep_time %% 100) %>% 
  mutate(sched_dep_time_hour = sched_dep_time %/% 100, 
         sched_dep_time_min = sched_dep_time %% 100) %>% 
  mutate(dep_time_min_mid = dep_time_hour * 60 + dep_time_min,
         sched_dep_time_min_mid = sched_dep_time_hour * 60 + sched_dep_time_min) %>% 
  select(-c(dep_time_hour:sched_dep_time_min))
summary(flights_midnight$sched_dep_time_min_mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      66     546     839     817    1049    1439
summary(flights_midnight$dep_time_min_mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0   547.0   841.0   822.2  1064.0  1440.0    8255
  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
flights_midnight %>% 
  mutate(air_time_calc = arr_time - dep_time) %>% 
  select(air_time, air_time_calc)
## # A tibble: 336,776 x 2
##    air_time air_time_calc
##       <dbl>         <int>
##  1      227           313
##  2      227           317
##  3      160           381
##  4      183           460
##  5      116           258
##  6      150           186
##  7      158           358
##  8       53           152
##  9      140           281
## 10      138           195
## # ... with 336,766 more rows

These should be the same, the issue is that arr_time and dep_time are not currently continuous values so they cannot be subtracted. Fix by using time from midnight for both arr and dep

flights_midnight <- flights_midnight %>% 
  mutate(arr_time_hour = arr_time %/% 100, 
       arr_time_min = arr_time %% 100) %>% 
  mutate(sched_arr_time_hour = sched_arr_time %/% 100, 
         sched_arr_time_min = sched_arr_time %% 100) %>% 
  mutate(arr_time_min_mid = arr_time_hour * 60 + arr_time_min,
         sched_arr_time_min_mid = sched_arr_time_hour * 60 + sched_arr_time_min) %>% 
  select(-c(arr_time_hour:sched_arr_time_min))
summary(flights_midnight$arr_time_min_mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     664     935     913    1180    1440    8713
summary(flights_midnight$sched_arr_time_min_mid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   684.0   956.0   933.4  1185.0  1439.0
flights_midnight %>% 
  mutate(air_time_calc = arr_time_min_mid - dep_time_min_mid) %>% 
  select(air_time, air_time_calc, arr_time_min_mid, dep_time_min_mid) %>% 
  filter(air_time_calc > 0, !is.na(air_time_calc), !is.na(air_time)) %>% 
  sample_n(10000) %>% 
  ggplot() + geom_point(aes(x = air_time, y = air_time_calc), alpha = 0.1)

  • Can’t explain the offsets
  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
  • dep_time should be sched_dep_time + dep_delay after converting to continuous
flights_midnight %>% 
  mutate(dep_time_calc = sched_dep_time_min_mid + dep_delay) %>% 
  select(dep_time_min_mid, dep_time_calc) %>% 
  filter(dep_time_calc < 1440) %>% 
  sample_n(5000) %>% 
  ggplot() + geom_point(aes(x = dep_time_min_mid, y = dep_time_calc))

  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
  • I’m using min_rank so ties are returned with the rank equal to the lowest rank value (as in sports)
flights_midnight %>% 
  select(dep_delay, tailnum) %>% 
  mutate(rank = min_rank(desc(dep_delay))) %>% 
  arrange(min_rank(desc(dep_delay))) %>% 
  slice(1:10)
## # A tibble: 10 x 3
##    dep_delay tailnum  rank
##        <dbl>   <chr> <int>
##  1      1301  N384HA     1
##  2      1137  N504MQ     2
##  3      1126  N517MQ     3
##  4      1014  N338AA     4
##  5      1005  N665MQ     5
##  6       960  N959DL     6
##  7       911  N927DA     7
##  8       899  N3762Y     8
##  9       898  N6716C     9
## 10       896  N5DMAA    10
  1. What does 1:3 + 1:10 return? Why?
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
##  [1]  2  4  6  5  7  9  8 10 12 11
  • The first 3 values are added however the remaining values are unchanged and a warning is given since the smaller vector is not a multiple of the longer vector
  1. What trigonometric functions does R provide?
#?Trig

5.6 Grouped Summaries with summarise()

summarise() collapses a data frame into a single row

5.6.7 Exercises

  1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

    • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
flights %>%
  group_by(flight) %>%
  summarize(early_15_min = sum(arr_delay <= -15, na.rm = TRUE) / n(),
            late_15_min = sum(arr_delay >= 15, na.rm = TRUE) / n(),
            n = n()) %>%
  filter(early_15_min == 0.5,
         late_15_min == 0.5)
## # A tibble: 18 x 4
##    flight early_15_min late_15_min     n
##     <int>        <dbl>       <dbl> <int>
##  1    107          0.5         0.5     2
##  2   2072          0.5         0.5     2
##  3   2366          0.5         0.5     2
##  4   2500          0.5         0.5     2
##  5   2552          0.5         0.5     2
##  6   3495          0.5         0.5     2
##  7   3518          0.5         0.5     2
##  8   3544          0.5         0.5     2
##  9   3651          0.5         0.5     2
## 10   3705          0.5         0.5     2
## 11   3916          0.5         0.5     2
## 12   3951          0.5         0.5     2
## 13   4273          0.5         0.5     2
## 14   4313          0.5         0.5     2
## 15   5297          0.5         0.5     2
## 16   5322          0.5         0.5     2
## 17   5388          0.5         0.5     2
## 18   5505          0.5         0.5     4
* A flight is always 10 minutes late.
flights %>% 
  group_by(flight) %>% 
  summarise(prop.same.late = n_distinct(arr_delay, na.rm = TRUE) / n(), 
            mean.arr.delay = mean(arr_delay, na.rm = TRUE),
            n = n()) %>%
  filter(prop.same.late == 1 & mean.arr.delay == 10)
## # A tibble: 4 x 4
##   flight prop.same.late mean.arr.delay     n
##    <int>          <dbl>          <dbl> <int>
## 1   2254              1             10     1
## 2   3656              1             10     1
## 3   3880              1             10     1
## 4   5854              1             10     1
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
flights %>% 
  group_by(flight) %>% 
  summarise(early.30.prop = sum(arr_delay <= -30, na.rm = TRUE) / n(),
            late.30.prop = sum(arr_delay >= 30, na.rm = TRUE) / n(),
            n = n()) %>% 
  filter(early.30.prop == .5 & late.30.prop == .5)
## # A tibble: 3 x 4
##   flight early.30.prop late.30.prop     n
##    <int>         <dbl>        <dbl> <int>
## 1   3651           0.5          0.5     2
## 2   3916           0.5          0.5     2
## 3   3951           0.5          0.5     2
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
flights %>% 
  group_by(flight) %>% 
  summarise(early.prop = sum(arr_delay <= 0, na.rm = TRUE) / n(),
            late.prop = sum(arr_delay >= 120, na.rm = TRUE) / n(),
            n = n()) %>% 
  filter(early.prop == .99 & late.prop == .01 )
## # A tibble: 0 x 4
## # ... with 4 variables: flight <int>, early.prop <dbl>, late.prop <dbl>,
## #   n <int>

Which is more important: arrival delay or departure delay?
* Arrival delay is more important since it affects making connections and schedules. Departure delay only affect wait time in the ariport

  1. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
not_cancelled <- flights %>% 
        filter(!is.na(dep_time))

not_cancelled %>% count(dest)
## # A tibble: 104 x 2
##     dest     n
##    <chr> <int>
##  1   ABQ   254
##  2   ACK   265
##  3   ALB   419
##  4   ANC     8
##  5   ATL 16898
##  6   AUS  2418
##  7   AVL   263
##  8   BDL   412
##  9   BGR   360
## 10   BHM   272
## # ... with 94 more rows
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(n = n())
## # A tibble: 104 x 2
##     dest     n
##    <chr> <int>
##  1   ABQ   254
##  2   ACK   265
##  3   ALB   419
##  4   ANC     8
##  5   ATL 16898
##  6   AUS  2418
##  7   AVL   263
##  8   BDL   412
##  9   BGR   360
## 10   BHM   272
## # ... with 94 more rows
not_cancelled %>% count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum      n
##      <chr>  <dbl>
##  1  D942DN   3418
##  2  N0EGMQ 240626
##  3  N10156 110389
##  4  N102UW  25722
##  5  N103US  24619
##  6  N104UW  25157
##  7  N10575 141475
##  8  N105UW  23618
##  9  N107US  21677
## 10  N108UW  32070
## # ... with 4,027 more rows
not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(n = sum(distance))
## # A tibble: 4,037 x 2
##    tailnum      n
##      <chr>  <dbl>
##  1  D942DN   3418
##  2  N0EGMQ 240626
##  3  N10156 110389
##  4  N102UW  25722
##  5  N103US  24619
##  6  N104UW  25157
##  7  N10575 141475
##  8  N105UW  23618
##  9  N107US  21677
## 10  N108UW  32070
## # ... with 4,027 more rows
  1. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?
flights %>% 
  filter(is.na(dep_delay) & !is.na(arr_delay))
## # A tibble: 0 x 19
## # ... with 19 variables: year <int>, month <int>, day <int>,
## #   dep_time <int>, sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

There are no flights which had a NA departure and an arrival time, so we can simplify to only filter for non-NA departures

  1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>% 
  mutate(date = ymd(paste(year, month, day))) %>% 
  group_by(year, month, day) %>% 
  summarise(canceled = sum(is.na(dep_time)),
            date = first(date),
            avg.delay = mean(arr_delay, na.rm = TRUE)) %>% 
  ggplot() +
  geom_line(aes(x = date, y = canceled)) +
  geom_line(aes(x = date, y = avg.delay), color = "red")

flights %>% 
  mutate(date = ymd(paste(year, month, day))) %>% 
  group_by(year, month, day) %>% 
  summarise(canceled = sum(is.na(dep_time)),
            date = first(date),
            avg.delay = mean(arr_delay, na.rm = TRUE)) %>% 
  ggplot(aes(x = log(canceled), y = avg.delay)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

  • Yes, there appears to be a relationship between the number of canceled flights and the average delay
  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
flights %>% 
  group_by(carrier) %>% 
  summarise(avg.delay = mean(arr_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg.delay))
## # A tibble: 16 x 2
##    carrier  avg.delay
##      <chr>      <dbl>
##  1      F9 21.9207048
##  2      FL 20.1159055
##  3      EV 15.7964311
##  4      YV 15.5569853
##  5      OO 11.9310345
##  6      MQ 10.7747334
##  7      WN  9.6491199
##  8      B6  9.4579733
##  9      9E  7.3796692
## 10      UA  3.5580111
## 11      US  2.1295951
## 12      VX  1.7644644
## 13      DL  1.6443409
## 14      AA  0.3642909
## 15      HA -6.9152047
## 16      AS -9.9308886
flights %>% 
  group_by(dest) %>% 
  summarise(avg.delay = mean(arr_delay, na.rm = TRUE),
            n = n()) %>% 
  arrange(desc(avg.delay))
## # A tibble: 105 x 3
##     dest avg.delay     n
##    <chr>     <dbl> <int>
##  1   CAE  41.76415   116
##  2   TUL  33.65986   315
##  3   OKC  30.61905   346
##  4   JAC  28.09524    25
##  5   TYS  24.06920   631
##  6   MSN  20.19604   572
##  7   RIC  20.11125  2454
##  8   CAK  19.69834   864
##  9   DSM  19.00574   569
## 10   GRR  18.18956   765
## # ... with 95 more rows
  • Average delays vary depending on the destination airport. I will attempt to separate airport effects by considering only destinations where each airline made a significant number of stops, and then randomly sample flights to hold the destination constant.
flights %>% 
        group_by(dest) %>% 
        summarise(num = length(unique(carrier))) %>% 
        arrange(desc(num))
## # A tibble: 105 x 2
##     dest   num
##    <chr> <int>
##  1   ATL     7
##  2   BOS     7
##  3   CLT     7
##  4   ORD     7
##  5   TPA     7
##  6   AUS     6
##  7   DCA     6
##  8   DTW     6
##  9   IAD     6
## 10   MSP     6
## # ... with 95 more rows
  • There are no airports that had flights from every carrier. Instead I will use only the destinations that had a low average delay so I can assume differences between destinations will be negligable.
low_delay_dests <- flights %>% 
  group_by(dest) %>% 
  summarise(avg.delay = mean(arr_delay, na.rm = TRUE),
            n = n()) %>% 
  filter(avg.delay > -2 & avg.delay < 12)
  • Confirm all carriers are still in the data set
flights %>% 
        filter(dest %in% low_delay_dests$dest) %>% 
        count(carrier)
## # A tibble: 16 x 2
##    carrier     n
##      <chr> <int>
##  1      9E 14005
##  2      AA 32426
##  3      AS   714
##  4      B6 53676
##  5      DL 47991
##  6      EV 32243
##  7      F9   685
##  8      FL  2337
##  9      HA   342
## 10      MQ 25861
## 11      OO    31
## 12      UA 57614
## 13      US 20535
## 14      VX  5143
## 15      WN  6852
## 16      YV   290
  • This leaves 71 destinations with low average delays. Now I can recheck the carriers average delays.
# Results with all destinations
flights %>% 
  group_by(carrier) %>% 
  summarise(avg.delay = mean(arr_delay, na.rm = TRUE),
            n = n()) %>% 
  arrange(desc(avg.delay))
## # A tibble: 16 x 3
##    carrier  avg.delay     n
##      <chr>      <dbl> <int>
##  1      F9 21.9207048   685
##  2      FL 20.1159055  3260
##  3      EV 15.7964311 54173
##  4      YV 15.5569853   601
##  5      OO 11.9310345    32
##  6      MQ 10.7747334 26397
##  7      WN  9.6491199 12275
##  8      B6  9.4579733 54635
##  9      9E  7.3796692 18460
## 10      UA  3.5580111 58665
## 11      US  2.1295951 20536
## 12      VX  1.7644644  5162
## 13      DL  1.6443409 48110
## 14      AA  0.3642909 32729
## 15      HA -6.9152047   342
## 16      AS -9.9308886   714
# Results filtered for low delay destinations
flights %>% 
        filter(dest %in% low_delay_dests$dest) %>% 
  group_by(carrier) %>% 
  summarise(avg.delay = mean(arr_delay, na.rm = TRUE),
            n = n()) %>% 
  arrange(desc(avg.delay))
## # A tibble: 16 x 3
##    carrier avg.delay     n
##      <chr>     <dbl> <int>
##  1      F9 21.920705   685
##  2      FL 20.744513  2337
##  3      EV 13.927878 32243
##  4      OO 12.250000    31
##  5      YV 12.045113   290
##  6      MQ 10.558850 25861
##  7      B6  9.401846 53676
##  8      WN  8.494365  6852
##  9      9E  8.007364 14005
## 10      UA  3.725334 57614
## 11      US  2.129595 20535
## 12      VX  1.815614  5143
## 13      DL  1.676433 47991
## 14      AA  0.410510 32426
## 15      HA -6.915205   342
## 16      AS -9.930889   714
  • Filtering out the destinations with high average delays doesn’t have a large effect on the rankings. F9 (Frontier) is still the worst airline with an average delay of 21.9 minutes
  1. What does the sort argument to count() do. When might you use it?
flights %>% 
        count(dest, sort = TRUE)
## # A tibble: 105 x 2
##     dest     n
##    <chr> <int>
##  1   ORD 17283
##  2   ATL 17215
##  3   LAX 16174
##  4   BOS 15508
##  5   MCO 14082
##  6   CLT 14064
##  7   SFO 13331
##  8   FLL 12055
##  9   MIA 11728
## 10   DCA  9705
## # ... with 95 more rows
  • The results are automatically sorted into descending order

5.7 Grouped Mutates (and Filters)

Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter()

5.7.1 Exercises

  1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

  2. Which plane (tailnum) has the worst on-time record?

flights %>% 
        group_by(tailnum) %>% 
        summarise(avg.delay = mean(arr_delay, na.rm = TRUE), n = n()) %>% 
        arrange(desc(avg.delay))
## # A tibble: 4,044 x 3
##    tailnum avg.delay     n
##      <chr>     <dbl> <int>
##  1  N844MH  320.0000     1
##  2  N911DA  294.0000     1
##  3  N922EV  276.0000     1
##  4  N587NW  264.0000     1
##  5  N851NW  219.0000     1
##  6  N928DN  201.0000     1
##  7  N7715E  188.0000     1
##  8  N654UA  185.0000     1
##  9  N665MQ  174.6667     6
## 10  N427SW  157.0000     1
## # ... with 4,034 more rows
  1. What time of day should you fly if you want to avoid delays as much as possible?
flights_midnight %>% 
        group_by(hour) %>% 
        summarise(avg.delay = mean(arr_delay, na.rm = TRUE)) %>% 
        arrange(avg.delay)
## # A tibble: 20 x 2
##     hour  avg.delay
##    <dbl>      <dbl>
##  1     7 -5.3044716
##  2     5 -4.7969072
##  3     6 -3.3844854
##  4     9 -1.4514074
##  5     8 -1.1132266
##  6    10  0.9539401
##  7    11  1.4819300
##  8    12  3.4890104
##  9    13  6.5447397
## 10    14  9.1976501
## 11    23 11.7552783
## 12    15 12.3241920
## 13    16 12.5976412
## 14    18 14.7887244
## 15    22 15.9671618
## 16    17 16.0402670
## 17    19 16.6558736
## 18    20 16.6761098
## 19    21 18.3869371
## 20     1        NaN
  • Early morning flights are the best, between 5-7am.
  1. For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.
flights %>% 
        group_by(dest) %>% 
        filter(arr_delay > 0) %>% 
        mutate(tot.delay = sum(arr_delay),
               prop.delay = arr_delay / tot.delay) %>% 
        select(year:day, dest, arr_delay, tot.delay, prop.delay)
## # A tibble: 133,004 x 7
## # Groups:   dest [103]
##     year month   day  dest arr_delay tot.delay   prop.delay
##    <int> <int> <int> <chr>     <dbl>     <dbl>        <dbl>
##  1  2013     1     1   IAH        11     99391 1.106740e-04
##  2  2013     1     1   IAH        20     99391 2.012255e-04
##  3  2013     1     1   MIA        33    140424 2.350026e-04
##  4  2013     1     1   ORD        12    283046 4.239594e-05
##  5  2013     1     1   FLL        19    202605 9.377853e-05
##  6  2013     1     1   ORD         8    283046 2.826396e-05
##  7  2013     1     1   LAX         7    203226 3.444441e-05
##  8  2013     1     1   DFW        31    110009 2.817951e-04
##  9  2013     1     1   ATL        12    300299 3.996017e-05
## 10  2013     1     1   DTW        16    138258 1.157257e-04
## # ... with 132,994 more rows
  1. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight.
flights_midnight %>% 
        group_by(origin) %>% 
        arrange(year, month, day, dep_time_min_mid) %>% 
        filter(!is.na(dep_delay)) %>% 
        mutate(prev.delay = lag(dep_delay)) %>% 
        ggplot(aes(x = dep_delay, y = prev.delay)) +
        geom_point() +
        geom_smooth(method = "lm")

  1. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
flights %>% 
        group_by(dest) %>% 
        mutate(avg.airtime = mean(air_time, na.rm = TRUE), std.airtime = scale(air_time), n = n()) %>% 
        select(year:day, dest, air_time, avg.airtime, std.airtime, n) %>% 
        filter(std.airtime < -4) %>% 
        arrange(std.airtime)
## # A tibble: 4 x 8
## # Groups:   dest [4]
##    year month   day  dest air_time avg.airtime std.airtime     n
##   <int> <int> <int> <chr>    <dbl>       <dbl>       <dbl> <int>
## 1  2013     7     2   MSP       93   150.57368   -4.899222  7185
## 2  2013     5    25   ATL       65   112.93045   -4.884688 17215
## 3  2013     5    13   GSP       55    93.39494   -4.722429   849
## 4  2013     3    23   BNA       70   114.38215   -4.049143  6333
  1. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
  • Rank by how many of the destinations a carrier flys to
pop_dest <- flights %>% 
        group_by(dest) %>% 
        summarise(num.carrier = length(unique(carrier))) %>% 
        filter(num.carrier >= 2) %>% 
        arrange(desc(num.carrier))

flights %>% 
        filter(dest %in% pop_dest$dest) %>% 
        group_by(carrier) %>% 
        summarise(num.dests = length(unique(dest))) %>% 
        arrange(desc(num.dests))
## # A tibble: 16 x 2
##    carrier num.dests
##      <chr>     <int>
##  1      EV        51
##  2      9E        48
##  3      UA        42
##  4      DL        39
##  5      B6        35
##  6      AA        19
##  7      MQ        19
##  8      WN        10
##  9      OO         5
## 10      US         5
## 11      VX         4
## 12      YV         3
## 13      FL         2
## 14      AS         1
## 15      F9         1
## 16      HA         1