```{r}
#| label: load-packages
# Load packages
library(readr) # for reading and writing data
library(dplyr) # for inspecting and transforming data
```
02. Reading, inspecting, and writing datasets
Basics of working with datasets in R
This Recipe guides you through the process of reading, inspecting, and writing datasets using R packages and functions in a Quarto environment. You’ll learn how to effectively combine code and narrative to create a reproducible document that can be shared with others.
Concepts and strategies
Quarto documents and code blocks
Ask you will remember from Recipes 0 and 1, Quarto documents can combine prose and code. The prose is written in Markdown and the code is written in R1. The code is contained in code blocks, which are opened by three backticks (`
), the name of the programming language, r
, in curly braces {r}
and three backticks (`
) to close the block. For example, the following minimal Quarto document contains an R code block:
---
title: My Quarto Document
format: pdf
---
# Goals
This script ...
```{r}
#| label: code-block-name
# R code goes here
```
As you can see in the code block, the ...
Code blocks have various options that can be added by using key-value pairs that are prefixed with #|
. Some common key-value pairs we will use in this Recipe are:
label
: A unique name for the code block. This is used to reference the code block.echo
: A boolean value (true
orfalse
) that determines whether the code is displayed in the output document.include
: A boolean value (true
orfalse
) that determines whether the output of the code is displayed in the output document.message
: A boolean value (true
orfalse
) that determines whether the messages from the code are displayed in the output document.
Setting up the environment
Before we can read, inspect, and write data, we need to load the packages that contain the functions we will use. We will use {readr} to read datasets into R and write datasets to disk and {dplyr} to inspect and transform (subset) the data.
There are a few ways to load packages into an R session. The most common way is to use the library()
function. The library()
function loads a package into the R session and stops the script if the package is not available on the current computing environment.
For example, the following code block loads {readr} and {dplyr} into the R session:
This code block assumes that {readr} and {dplyr} are installed on the current computing environment. If the packages are not installed, the code block will stop and display an error message, such as:
Error in library(readr) : there is no package called ‘readr’
This error can be addressed by installing the missing package with install.packages("readr")
and then re-running the code block. This is not ideal for reproducibility, however, because the code block will stop if the package is not installed. We will consider a more reproducible approach later in the course.
Understanding the data
Now that we have our environment set up, we can read the dataset into R. But before we do, we should make sure that we understand the data by looking at the data documentation.
The dataset that we will read into our R session based on the Brown Corpus (Francis and Kuçera 1961). I’ve created a data origin file that contains the data documentation for the Brown Corpus, as we can see in Table 1.
# Read and display the data origin file
read_csv(file = "data/original/brown_passives_do.csv") |>
kable() |>
kable_styling() |>
column_spec(1, width = "15em")
attribute | description |
---|---|
Resource name | Brown Corpus |
Data source | http://korpus.uib.no/icame/brown/bcm.html |
Data sampling frame | Edited American English prose from various genres, published in the United States during the calendar year 1961. |
Data collection date(s) | Originally published in 1964, revised in 1971 and 1979. |
Data format | Multiple formats including Form A (original), Form B (stripped version), Form C (tagged version), Bergen Forms I and II, and Brown MARC Form. |
Data schema | 500 samples of approximately 2000 words each, covering a wide range of genres and styles. Includes coding for major and minor headings, special types (italics, bold, etc.), abbreviations, symbols, and other textual features. |
License | Use restricted for scholarly research in linguistics, stylistics, and other disciplines. Specific copyright restrictions detailed in the manual. |
Attribution | W. Nelson Francis and Henry Kucera, Brown University, 1964, revised 1971 and 1979. |
This data origin file provides an overview of the original data source. In this case, the dataset we will read into R is a subset of the Brown Corpus which is an aggregate of the use of passive voice. This dataset was developed by the authors of {corpora} (Evert 2023). I’ve exported the dataset to a CSV file, which we will read into R.
The data dictionary which describes the dataset we will read appears in Table 2.
# Read and display the data documentation file
read_csv(file = "../data/derived/brown_passives_curated_dd.csv") |>
kable() |>
kable_styling()
variable | name | variable_type | description |
---|---|---|---|
cat | Category | categorical | Genre categories represented by letters |
passive | Passive | numeric | Number of passive verb phrases in the genre |
n_w | Number of words | numeric | Number of words in the genre |
n_s | Number of sentences | numeric | Number of sentences in the genre |
name | Genre | categorical | Genre name |
With this information, we are now in a position to read and inspect the dataset.
Reading datasets into R with {readr}
We’ve now prepared our Quarto document by loading the packages we will use and and we have reviewed the dataset documentation so that we understand the dataset we will read into R. We are now ready to read the dataset into R.
R provides a number of functions to read data of many types in R. We will explore many types of data and datasets in this course. For now, we will focus on reading rectangular data into R. Rectangular data is data that is organized in rows and columns, such as a spreadsheet.
One of the most common file formats for rectangular data is the comma-separated values (CSV) file. CSV files are text files in which lines represent rows and commas separate columns of data. For example, the sample CSV file snippet below contains three rows and three columns of data:
"word","frequency","part_of_speech"
"the",69971,"article"
"of",36412,"preposition"
"and",28853,"conjunction"
A CSV file is a type of delimited file, which means that the data is separated by a delimiter. In the case of a CSV file, the delimiter is a comma. Other types of delimited files use different delimiters, such as tab-separated values (TSV) files which use a tab character as the delimiter, or even a pipe (|
) or semicolon (;
).
{readr} provides functions to read rectangular dataset into R. The read_csv()
function reads CSV files, the read_tsv()
function reads TSV files, and the read_delim()
function reads other types of delimited files.
Let’s use the read_csv()
function to read the brown_passives_curated.csv
file into R. To do this we will use the file =
argument to specify the path to the file. Now, the file “path” is the location of the file on the computer. We can specify this path in two ways:
- Relative path: The relative path is the path to the file relative to the current working directory. The current working directory is the directory in which the R session is running.
- Absolute path: The absolute path is the path to the file from the root directory of the computer.
For most purpose, the relative path is the better option because it is more portable. For example, if you share your code with someone else, they may have a different absolute path to the file. However, they will likely have the same relative path to the file.
Let’s say that the directory structure of our project is as follows:
project/
├── data/
│ ├── original/
│ │ └── brown_passives_do.csv
│ └── derived/
│ └── brown_passives_curated.csv
└── code/
└── reading-inspecting-writing.qmd
In this case, the relative path from reading-inspecting-writing.qmd
to the brown_passives_curated.csv
file is ../data/derived/brown_passives_curated.csv
. The ..
means “go up one directory” and the rest of the path is the path to the file from the project/
directory.
With this in mind, we can read the brown_passives_curated.csv
file into R with the following code block:
#| label: read-dataset-brown-passives-curated
# Read the dataset
<-
brown_passives_df read_csv(file = "../data/derived/brown_passives_curated.csv")
Running the above code chunk in our Quarto document will read the dataset into R and assign it to the brown_passives_df
variable. It will also show the code used to read the dataset into R. Furthermore, so functions will display messages in the output. For example, the read_csv()
function will display a message that various parsing options were used to read the dataset into R.
This information can be helpful in an interactive session, as read_csv()
tells us the dimensions of the dataset and the data types of each column. But this output is not necessary, and is unnecessarily verbose in a reproducible document.
We can hide any messages produced by a function by using the message = false
key-value pair in the code block. For example, the following code block will read the dataset into R and assign it to the brown_passives_df
variable without displaying any messages:
#| label: read-dataset-brown-passives-curated
#| message: false
# Read the dataset
<-
brown_passives_df read_csv(file = "../data/derived/brown_passives_curated.csv")
No messages are displayed in the document output.
Inspecting datasets with {dplyr}
The objective of this section is to demonstrate how to inspect and transform (subset) datasets using {dplyr}. We will use {dplyr} to inspect the dataset we read into R in the previous section.
Reading a CSV file into R will create a data frame object. Thus, I assigned the result to brown_passives_df
. The df
suffix is a common naming convention for rectangular data frames. It is good practice to use a consistent naming convention for objects in your code. This makes it easier to understand the code and to avoid errors.
Let’s do get an overview of the dataset by using the glimpse()
function from {dplyr}. The glimpse()
function displays the dimensions of the data frame and the data types of each column.
# Preview
glimpse(brown_passives_df)
Rows: 15
Columns: 5
$ cat <chr> "A", "B", "C", "D", "E", "F", "G", "H", "J", "K", "L", "M", "N…
$ passive <dbl> 892, 543, 283, 351, 853, 1034, 1460, 837, 2423, 352, 265, 104,…
$ n_w <dbl> 101196, 61535, 40749, 39029, 82010, 110363, 173017, 69446, 181…
$ n_s <dbl> 3684, 2399, 1459, 1372, 3286, 4387, 6537, 2012, 6311, 3983, 36…
$ name <chr> "press reportage", "press editorial", "press reviews", "religi…
If we want a more, tabular-like view of the data, we can simply print the dataset frame to the console. It’s worth mentioning, that all {readr} functions return tibbles, so we gain the benefits of tibbles when we read dataset into R with {readr} functions, one of which is that we do not have to worry that printing a data frame to the console, or our document, will print all of the data.
# Print the data frame
brown_passives_df
# A tibble: 15 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 A 892 101196 3684 press reportage
2 B 543 61535 2399 press editorial
3 C 283 40749 1459 press reviews
4 D 351 39029 1372 religion
5 E 853 82010 3286 skills / hobbies
6 F 1034 110363 4387 popular lore
7 G 1460 173017 6537 belles lettres
8 H 837 69446 2012 miscellaneous
9 J 2423 181426 6311 learned
10 K 352 68599 3983 general fiction
11 L 265 57624 3673 detective
12 M 104 14433 873 science fiction
13 N 290 69909 4438 adventure
14 P 290 70476 4187 romance
15 R 146 21757 975 humour
By default, printing tibbles will return the first 10 rows and all columns, unless the columns are too numerous to display width-wise.
{dplyr} also provides a set of slice_*()
functions which allow us to display the data in a tabular fashion, with some additional options. There are three slice_*()
functions we will cover here:
slice_head()
: Select the firstn
rows of the data frame.slice_tail()
: Select the lastn
rows of the data frame.slice_sample()
: Select a random sample ofn
rows from the data frame.
For example, the following code block will select the first 5 rows of the data frame:
# Select the first 5 rows
slice_head(brown_passives_df, n = 5)
# A tibble: 5 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 A 892 101196 3684 press reportage
2 B 543 61535 2399 press editorial
3 C 283 40749 1459 press reviews
4 D 351 39029 1372 religion
5 E 853 82010 3286 skills / hobbies
We can also select the last 5 rows of the data frame with the slice_tail()
function:
# Select the last 5 rows
slice_tail(brown_passives_df, n = 5)
# A tibble: 5 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 L 265 57624 3673 detective
2 M 104 14433 873 science fiction
3 N 290 69909 4438 adventure
4 P 290 70476 4187 romance
5 R 146 21757 975 humour
Finally, we can select a random sample of 5 rows from the data frame with the slice_sample()
function:
# Select a random sample of 5 rows
slice_sample(brown_passives_df, n = 5)
# A tibble: 5 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 B 543 61535 2399 press editorial
2 D 351 39029 1372 religion
3 H 837 69446 2012 miscellaneous
4 R 146 21757 975 humour
5 P 290 70476 4187 romance
These functions can be helpful to get a sense of the dataset in different ways. In combination with arrange()
function, we can also sort the data frame by a column or columns and then select the first or last rows.
For example, the following code block will sort the data frame by the passive
column in ascending order and then select the first 5 rows:
# Sort by the `passive` column and select the first 5 rows
slice_head(arrange(brown_passives_df, passive), n = 5)
# A tibble: 5 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 M 104 14433 873 science fiction
2 R 146 21757 975 humour
3 L 265 57624 3673 detective
4 C 283 40749 1459 press reviews
5 N 290 69909 4438 adventure
If we want to sort be descending order, we can surround the column name with desc()
, arrange(desc(passive))
.
Now, the previous code block does what we want, but it is not very readable. Enter the pipe operator. The pipe operator |>
is an operator which allows us to chain the output of one function to the input of another function. This allows us to write more readable code.
|>
brown_passives_df arrange(passive) |>
slice_head(n = 5)
# A tibble: 5 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 M 104 14433 873 science fiction
2 R 146 21757 975 humour
3 L 265 57624 3673 detective
4 C 283 40749 1459 press reviews
5 N 290 69909 4438 adventure
The result is the same but the code makes more sense. We can read the code from left to right, top to bottom, which is the order in which the functions are executed.
In addition to being more legible, using the pipe with each function on its own line allows us to add comments to each line of code. For example, the following code block is the same as the previous code block, but with comments added.
# Sort by the passive column and select the first 5 rows
|>
brown_passives_df arrange(passive) |>
slice_head(n = 5)
It is a good practice to add comments when writing code, as long as it makes the code more readable and easier to understand for others and for your future self! If the comments are too verbose, and only repeat what the code is ‘saying’, then don’t include them.
Subsetting datasets with {dplyr}
Now that we have a sense of the data, we can subset the dataset to create a variations of our original data frame. We can subset the data frame by selecting columns and/ or rows.
In the R lesson “Packages and Functions”, we saw that base R provides the bracket ([]
) operator to subset data frames. {dplyr} provides functions to subset data frames which can be more readable and easier to use.
Let’s first look a selecting columns. The select()
function allows us to select columns by name. For example, the following code block will select the passive
and n_w
columns from the data frame:
# Select the `passive` and `n_w` columns
select(brown_passives_df, passive, n_w)
# A tibble: 15 × 2
passive n_w
<dbl> <dbl>
1 892 101196
2 543 61535
3 283 40749
4 351 39029
5 853 82010
6 1034 110363
7 1460 173017
8 837 69446
9 2423 181426
10 352 68599
11 265 57624
12 104 14433
13 290 69909
14 290 70476
15 146 21757
Beyond selecting columns, we can also reorder columns and rename columns. For example, the following code block will select the passive
and n_w
columns, rename the n_w
column to num_words
, and reorder the columns so that num_words
is the first column:
# Select rename and reorder columns
|>
brown_passives_df select(num_words = n_w, passive)
# A tibble: 15 × 2
num_words passive
<dbl> <dbl>
1 101196 892
2 61535 543
3 40749 283
4 39029 351
5 82010 853
6 110363 1034
7 173017 1460
8 69446 837
9 181426 2423
10 68599 352
11 57624 265
12 14433 104
13 69909 290
14 70476 290
15 21757 146
By selecting some columns and not others, we have effectively dropped the columns we did not select. If it is more effective to drop columns by name, we can use the select()
function with the -
operator. For example, the following code block will drop the cat
column from the data frame:
# Drop the `n_w` column
|>
brown_passives_df select(-cat)
# A tibble: 15 × 4
passive n_w n_s name
<dbl> <dbl> <dbl> <chr>
1 892 101196 3684 press reportage
2 543 61535 2399 press editorial
3 283 40749 1459 press reviews
4 351 39029 1372 religion
5 853 82010 3286 skills / hobbies
6 1034 110363 4387 popular lore
7 1460 173017 6537 belles lettres
8 837 69446 2012 miscellaneous
9 2423 181426 6311 learned
10 352 68599 3983 general fiction
11 265 57624 3673 detective
12 104 14433 873 science fiction
13 290 69909 4438 adventure
14 290 70476 4187 romance
15 146 21757 975 humour
Let’s now turn our attention to subsetting rows. The filter()
function allows us to select rows by a logical condition. For example, the following code block will select rows where the values of the passive
column are less than <
1,000:
# Select rows where `passive` is less than 1,000
|>
brown_passives_df filter(passive < 1000)
# A tibble: 12 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 A 892 101196 3684 press reportage
2 B 543 61535 2399 press editorial
3 C 283 40749 1459 press reviews
4 D 351 39029 1372 religion
5 E 853 82010 3286 skills / hobbies
6 H 837 69446 2012 miscellaneous
7 K 352 68599 3983 general fiction
8 L 265 57624 3673 detective
9 M 104 14433 873 science fiction
10 N 290 69909 4438 adventure
11 P 290 70476 4187 romance
12 R 146 21757 975 humour
We can also use the filter()
function to select rows by a character string. For example, the following code block will select rows where the values of the name
column are equal to religion
:
# Select rows where `name` is equal to `religion`
|>
brown_passives_df filter(name == "religion")
# A tibble: 1 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 D 351 39029 1372 religion
The inequality operator !=
can be used for character strings as well. To include multiple values, we can use the %in%
operator. In this case we can pass a vector of values to the filter()
function. For example, the following code block will select rows where the values of the name
column are equal to religion
or learned
:
# Select multiple values
|>
brown_passives_df filter(name %in% c("religion", "learned", "detective"))
# A tibble: 3 × 5
cat passive n_w n_s name
<chr> <dbl> <dbl> <dbl> <chr>
1 D 351 39029 1372 religion
2 J 2423 181426 6311 learned
3 L 265 57624 3673 detective
Writing datasets to a file with {readr}
Finally, we can write data, including data frames, to a file with the write_*()
functions from {readr}. The write_*()
functions include:
write_csv()
: Write a data frame to a CSV file.write_tsv()
: Write a data frame to a TSV file.write_delim()
: Write a data frame to a delimited file with the specified delimiter (|
,;
, etc).
To create a distinct data frame from the one we read into R, let’s subset our brown_passives_df
data frame by columns and rows to create a new data frame that contains only the passive
, n_w
, and name
columns and only the rows where the values of the passive
column are greater than >
1,000 and assign it to the brown_passives_subset_df
.
# Subset the data frame
<-
brown_passives_subset_df |>
brown_passives_df select(passive, n_w, name) |>
filter(passive > 1000)
Now the following code block will write the brown_passives_subset_df
data frame to a CSV file given the specified file path:
# Write the data frame to a CSV file
write_csv(
x = brown_passives_subset_df,
file = "../data/derived/brown_passives_subset.csv"
)
Given the example directory structure we saw earlier, our new file appears in the data/derived/
directory.
project/
├── data/
│ ├── original/
│ │ └── brown_passives_do.csv
│ └── derived/
│ ├── brown_passives_curated.csv
│ ├── brown_passives_curated_dd.csv
│ └── brown_passives_subset.csv
└── code/
└── reading-inspecting-writing.qmd
There is much more to learn about reading, inspecting, and writing datasets in R. We will introduce more functions and techniques in the coming lessons. For now, we have learned how to read, inspect, and write datasets using R functions and Quarto code blocks!
Check your understanding
- {readr} provides functions to read rectangular data into R.
- The option in a code block determines whether the code is displayed in the output document.
- {dplyr} provides functions to create data dictionaries.
- is used to read tab-separated values (TSV) files.
- Which function is in {dplyr} is used to select columns by name?
- The R pipe operator
|>
allows us to chain the output of one function to the input of another function.
Lab preparation
In Lab 2 you will have the opportunity to apply the skills you learned in this Recipe to create a Quarto document that reads, inspects, and writes data.
In addition to the knowledge and skills you have developed in Labs 0 and 1, to complete Lab 2, you will need to be able to:
- Create code blocks in a Quarto document
- Understand the purpose of the
label
,echo
,message
, andinclude
options in a code block - Load packages into an R session with
library()
- Understand how to read and create file relative file paths
- Read datasets into R with the
read_csv()
function - Inspect data frames with {dplyr} functions such as
glimpse()
,slice_head()
,slice_tail()
,slice_sample()
, andarrange()
. - Use the
|>
pipe operator to chain functions together. - Subset data frames with {dplyr} functions such as
select()
andfilter()
. - Write data frames to a file with the
write_csv()
function.
References
Footnotes
Code block can be written in other programming languages as well such as Python, Bash, etc.↩︎