This post covers the content and exercises for Ch 11: Data Import from R for Data Science. The chapter teaches how to read in plain text files of data.

library(tidyverse)

11.2 Getting started

Using {readr} to load text files.

11.2.2 Exercises

  1. What function would you use to read a file where fields were separated with “|”?
  • read_delim()
  1. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?
  • col_names
  • col_types
  • locale
  • na
  • quoted_na
  • quote
  • trim_ws
  • n_max
  • guess_max
  • progress
  1. What are the most important arguments to read_fwf()?
  • file
  • col_positions
  1. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or ’. By convention, read_csv() assumes that the quoting character will be “, and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?

"x,y\n1,'a,b'"

(text <- read_csv("x,y\n1,'a,b'", quote = "'"))
## # A tibble: 1 x 2
##       x     y
##   <int> <chr>
## 1     1   a,b
  1. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
# only 2 column names
# drop last column
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row   col  expected    actual         file expected   <int> <chr>     <chr>     <chr>        <chr> actual 1     1  <NA> 2 columns 3 columns literal data file 2     2  <NA> 2 columns 3 columns literal data
## # A tibble: 2 x 2
##       a     b
##   <int> <int>
## 1     1     2
## 2     4     5
# 3 col names, 1st row has two obs, 2nd row has 4 obs
# add NA to 1st row, drop 4th col from 2nd row
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row   col  expected    actual         file expected   <int> <chr>     <chr>     <chr>        <chr> actual 1     1  <NA> 3 columns 2 columns literal data file 2     2  <NA> 3 columns 4 columns literal data
## # A tibble: 2 x 3
##       a     b     c
##   <int> <int> <int>
## 1     1     2    NA
## 2     1     2     3
# 2 col names, 1st row has one obs
# add NA to 1st row
read_csv("a,b\n\"1")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row   col                     expected    actual         file expected   <int> <chr>                        <chr>     <chr>        <chr> actual 1     1     a closing quote at end of file           literal data file 2     1  <NA>                    2 columns 1 columns literal data
## # A tibble: 1 x 2
##       a     b
##   <int> <chr>
## 1     1  <NA>
# a and b are coerced to character since they contain text
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
##       a     b
##   <chr> <chr>
## 1     1     2
## 2     a     b
# should use read_delim
read_delim("a;b\n1;3", delim = ";")
## # A tibble: 1 x 2
##       a     b
##   <int> <int>
## 1     1     3

11.3 Parsing a vector

11.3.5 Exercises

  1. What are the most important arguments to locale()?
#?locale()
  1. What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
locale(decimal_mark = ",", grouping_mark = ",")
## Error: `decimal_mark` and `grouping_mark` must be different
# error occurs

locale(decimal_mark = ",")
## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM
# default gouping mark changed to "."

locale(grouping_mark = ".")
## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM
# default decimal mark changed to ","
  1. I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.
  • From the readr vignette, time_format currently isn’t used but date_format is used to specify the format of the dates in the data
parse_guess("01/12/2017", locale = locale(date_format = "%d/%m/%Y"))
## [1] "2017-12-01"
  1. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
  • Luckily I’m in the US
  1. What’s the difference between read_csv() and read_csv2()?
  • The difference is in the default delimeter.
  • read_csv() uses a comma
  • read_csv2() uses a semicolon

6 What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

  • UTF-8 has been the dominant character encoding for the World Wide Web since 2009, and as of November 2017 accounts for 90.3% of all Web pages.
  1. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, format = "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, format = "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, format = "%I:%M:%OS %p")
## 23:15:10.12