02. Reading, inspecting, and writing datasets

Basics of working with datasets in R

foundations

This Recipe guides you through the process of reading, inspecting, and writing datasets using R packages and functions in a Quarto environment. You’ll learn how to effectively combine code and narrative to create a reproducible document that can be shared with others.

Skills

Loading packages into an R session
Reading datasets into R with read_*() functions
Inspecting datasets with {dplyr} functions
Writing datasets to a file with write_*() functions

Concepts and strategies

Quarto documents and code blocks

Ask you will remember from Recipes 0 and 1, Quarto documents can combine prose and code. The prose is written in Markdown and the code is written in R¹. The code is contained in code blocks, which are opened by three backticks (`), the name of the programming language, r, in curly braces {r} and three backticks (`) to close the block. For example, the following minimal Quarto document contains an R code block:

---
title: My Quarto Document
format: pdf
---

# Goals

This script ...

```{r}
#| label: code-block-name

# R code goes here
```

As you can see in the code block, the ...

Code blocks have various options that can be added by using key-value pairs that are prefixed with #|. Some common key-value pairs we will use in this Recipe are:

label: A unique name for the code block. This is used to reference the code block.
echo: A boolean value (true or false) that determines whether the code is displayed in the output document.
include: A boolean value (true or false) that determines whether the output of the code is displayed in the output document.
message: A boolean value (true or false) that determines whether the messages from the code are displayed in the output document.

Setting up the environment

Before we can read, inspect, and write data, we need to load the packages that contain the functions we will use. We will use {readr} to read datasets into R and write datasets to disk and {dplyr} to inspect and transform (subset) the data.

There are a few ways to load packages into an R session. The most common way is to use the library() function. The library() function loads a package into the R session and stops the script if the package is not available on the current computing environment.

For example, the following code block loads {readr} and {dplyr} into the R session:

```{r}
#| label: load-packages

# Load packages
library(readr) # for reading and writing data
library(dplyr) # for inspecting and transforming data
```

This code block assumes that {readr} and {dplyr} are installed on the current computing environment. If the packages are not installed, the code block will stop and display an error message, such as:

Error in library(readr) : there is no package called ‘readr’

This error can be addressed by installing the missing package with install.packages("readr") and then re-running the code block. This is not ideal for reproducibility, however, because the code block will stop if the package is not installed. We will consider a more reproducible approach later in the course.

Dive deeper

If you interested in learning about safeguarding package loading in a reproducible way, see {renv}. {renv} is a project-oriented workflow to create a reproducible environment for R projects. For more information, see the renv documentation and/ or Recipe 11

Understanding the data

Now that we have our environment set up, we can read the dataset into R. But before we do, we should make sure that we understand the data by looking at the data documentation.

The dataset that we will read into our R session based on the Brown Corpus (Francis and Kuçera 1961). I’ve created a data origin file that contains the data documentation for the Brown Corpus, as we can see in Table 1.

# Read and display the data origin file

read_csv(file = "data/original/brown_passives_do.csv") |>
  kable() |>
  kable_styling() |>
  column_spec(1, width = "15em")

Table 1: Data origin file for the Brown Corpus.

attribute	description
Resource name	Brown Corpus
Data source	http://korpus.uib.no/icame/brown/bcm.html
Data sampling frame	Edited American English prose from various genres, published in the United States during the calendar year 1961.
Data collection date(s)	Originally published in 1964, revised in 1971 and 1979.
Data format	Multiple formats including Form A (original), Form B (stripped version), Form C (tagged version), Bergen Forms I and II, and Brown MARC Form.
Data schema	500 samples of approximately 2000 words each, covering a wide range of genres and styles. Includes coding for major and minor headings, special types (italics, bold, etc.), abbreviations, symbols, and other textual features.
License	Use restricted for scholarly research in linguistics, stylistics, and other disciplines. Specific copyright restrictions detailed in the manual.
Attribution	W. Nelson Francis and Henry Kucera, Brown University, 1964, revised 1971 and 1979.

This data origin file provides an overview of the original data source. In this case, the dataset we will read into R is a subset of the Brown Corpus which is an aggregate of the use of passive voice. This dataset was developed by the authors of {corpora} (Evert 2023). I’ve exported the dataset to a CSV file, which we will read into R.

The data dictionary which describes the dataset we will read appears in Table 2.

# Read and display the data documentation file
read_csv(file = "../data/derived/brown_passives_curated_dd.csv") |>
  kable() |>
  kable_styling()

Table 2: Data dictionary file for the Brown Corpus.

variable	name	variable_type	description
cat	Category	categorical	Genre categories represented by letters
passive	Passive	numeric	Number of passive verb phrases in the genre
n_w	Number of words	numeric	Number of words in the genre
n_s	Number of sentences	numeric	Number of sentences in the genre
name	Genre	categorical	Genre name

With this information, we are now in a position to read and inspect the dataset.

Reading datasets into R with {readr}

We’ve now prepared our Quarto document by loading the packages we will use and and we have reviewed the dataset documentation so that we understand the dataset we will read into R. We are now ready to read the dataset into R.

R provides a number of functions to read data of many types in R. We will explore many types of data and datasets in this course. For now, we will focus on reading rectangular data into R. Rectangular data is data that is organized in rows and columns, such as a spreadsheet.

One of the most common file formats for rectangular data is the comma-separated values (CSV) file. CSV files are text files in which lines represent rows and commas separate columns of data. For example, the sample CSV file snippet below contains three rows and three columns of data:

"word","frequency","part_of_speech"
"the",69971,"article"
"of",36412,"preposition"
"and",28853,"conjunction"

A CSV file is a type of delimited file, which means that the data is separated by a delimiter. In the case of a CSV file, the delimiter is a comma. Other types of delimited files use different delimiters, such as tab-separated values (TSV) files which use a tab character as the delimiter, or even a pipe (|) or semicolon (;).

{readr} provides functions to read rectangular dataset into R. The read_csv() function reads CSV files, the read_tsv() function reads TSV files, and the read_delim() function reads other types of delimited files.

Let’s use the read_csv() function to read the brown_passives_curated.csv file into R. To do this we will use the file = argument to specify the path to the file. Now, the file “path” is the location of the file on the computer. We can specify this path in two ways:

Relative path: The relative path is the path to the file relative to the current working directory. The current working directory is the directory in which the R session is running.
Absolute path: The absolute path is the path to the file from the root directory of the computer.

For most purpose, the relative path is the better option because it is more portable. For example, if you share your code with someone else, they may have a different absolute path to the file. However, they will likely have the same relative path to the file.

Let’s say that the directory structure of our project is as follows:

project/
├── data/
│   ├── original/
│   │   └── brown_passives_do.csv
│   └── derived/
│       └── brown_passives_curated.csv
└── code/
    └── reading-inspecting-writing.qmd

In this case, the relative path from reading-inspecting-writing.qmd to the brown_passives_curated.csv file is ../data/derived/brown_passives_curated.csv. The .. means “go up one directory” and the rest of the path is the path to the file from the project/ directory.

With this in mind, we can read the brown_passives_curated.csv file into R with the following code block:

#| label: read-dataset-brown-passives-curated

# Read the dataset
brown_passives_df <-
  read_csv(file = "../data/derived/brown_passives_curated.csv")

Running the above code chunk in our Quarto document will read the dataset into R and assign it to the brown_passives_df variable. It will also show the code used to read the dataset into R. Furthermore, so functions will display messages in the output. For example, the read_csv() function will display a message that various parsing options were used to read the dataset into R.

This information can be helpful in an interactive session, as read_csv() tells us the dimensions of the dataset and the data types of each column. But this output is not necessary, and is unnecessarily verbose in a reproducible document.

We can hide any messages produced by a function by using the message = false key-value pair in the code block. For example, the following code block will read the dataset into R and assign it to the brown_passives_df variable without displaying any messages:

#| label: read-dataset-brown-passives-curated
#| message: false

# Read the dataset
brown_passives_df <-
  read_csv(file = "../data/derived/brown_passives_curated.csv")

No messages are displayed in the document output.

Inspecting datasets with {dplyr}

The objective of this section is to demonstrate how to inspect and transform (subset) datasets using {dplyr}. We will use {dplyr} to inspect the dataset we read into R in the previous section.

Reading a CSV file into R will create a data frame object. Thus, I assigned the result to brown_passives_df. The df suffix is a common naming convention for rectangular data frames. It is good practice to use a consistent naming convention for objects in your code. This makes it easier to understand the code and to avoid errors.

Let’s do get an overview of the dataset by using the glimpse() function from {dplyr}. The glimpse() function displays the dimensions of the data frame and the data types of each column.

# Preview
glimpse(brown_passives_df)

Rows: 15
Columns: 5
$ cat     <chr> "A", "B", "C", "D", "E", "F", "G", "H", "J", "K", "L", "M", "N…
$ passive <dbl> 892, 543, 283, 351, 853, 1034, 1460, 837, 2423, 352, 265, 104,…
$ n_w     <dbl> 101196, 61535, 40749, 39029, 82010, 110363, 173017, 69446, 181…
$ n_s     <dbl> 3684, 2399, 1459, 1372, 3286, 4387, 6537, 2012, 6311, 3983, 36…
$ name    <chr> "press reportage", "press editorial", "press reviews", "religi…

If we want a more, tabular-like view of the data, we can simply print the dataset frame to the console. It’s worth mentioning, that all {readr} functions return tibbles, so we gain the benefits of tibbles when we read dataset into R with {readr} functions, one of which is that we do not have to worry that printing a data frame to the console, or our document, will print all of the data.

# Print the data frame
brown_passives_df

# A tibble: 15 × 5
   cat   passive    n_w   n_s name            
   <chr>   <dbl>  <dbl> <dbl> <chr>           
 1 A         892 101196  3684 press reportage 
 2 B         543  61535  2399 press editorial 
 3 C         283  40749  1459 press reviews   
 4 D         351  39029  1372 religion        
 5 E         853  82010  3286 skills / hobbies
 6 F        1034 110363  4387 popular lore    
 7 G        1460 173017  6537 belles lettres  
 8 H         837  69446  2012 miscellaneous   
 9 J        2423 181426  6311 learned         
10 K         352  68599  3983 general fiction 
11 L         265  57624  3673 detective       
12 M         104  14433   873 science fiction 
13 N         290  69909  4438 adventure       
14 P         290  70476  4187 romance         
15 R         146  21757   975 humour

By default, printing tibbles will return the first 10 rows and all columns, unless the columns are too numerous to display width-wise.

{dplyr} also provides a set of slice_*() functions which allow us to display the data in a tabular fashion, with some additional options. There are three slice_*() functions we will cover here:

slice_head(): Select the first n rows of the data frame.
slice_tail(): Select the last n rows of the data frame.
slice_sample(): Select a random sample of n rows from the data frame.

For example, the following code block will select the first 5 rows of the data frame:

# Select the first 5 rows
slice_head(brown_passives_df, n = 5)

# A tibble: 5 × 5
  cat   passive    n_w   n_s name            
  <chr>   <dbl>  <dbl> <dbl> <chr>           
1 A         892 101196  3684 press reportage 
2 B         543  61535  2399 press editorial 
3 C         283  40749  1459 press reviews   
4 D         351  39029  1372 religion        
5 E         853  82010  3286 skills / hobbies

We can also select the last 5 rows of the data frame with the slice_tail() function:

# Select the last 5 rows
slice_tail(brown_passives_df, n = 5)

# A tibble: 5 × 5
  cat   passive   n_w   n_s name           
  <chr>   <dbl> <dbl> <dbl> <chr>          
1 L         265 57624  3673 detective      
2 M         104 14433   873 science fiction
3 N         290 69909  4438 adventure      
4 P         290 70476  4187 romance        
5 R         146 21757   975 humour

Finally, we can select a random sample of 5 rows from the data frame with the slice_sample() function:

# Select a random sample of 5 rows
slice_sample(brown_passives_df, n = 5)

# A tibble: 5 × 5
  cat   passive   n_w   n_s name           
  <chr>   <dbl> <dbl> <dbl> <chr>          
1 B         543 61535  2399 press editorial
2 D         351 39029  1372 religion       
3 H         837 69446  2012 miscellaneous  
4 R         146 21757   975 humour         
5 P         290 70476  4187 romance

These functions can be helpful to get a sense of the dataset in different ways. In combination with arrange() function, we can also sort the data frame by a column or columns and then select the first or last rows.

For example, the following code block will sort the data frame by the passive column in ascending order and then select the first 5 rows:

# Sort by the `passive` column and select the first 5 rows
slice_head(arrange(brown_passives_df, passive), n = 5)

# A tibble: 5 × 5
  cat   passive   n_w   n_s name           
  <chr>   <dbl> <dbl> <dbl> <chr>          
1 M         104 14433   873 science fiction
2 R         146 21757   975 humour         
3 L         265 57624  3673 detective      
4 C         283 40749  1459 press reviews  
5 N         290 69909  4438 adventure

If we want to sort be descending order, we can surround the column name with desc(), arrange(desc(passive)).

Now, the previous code block does what we want, but it is not very readable. Enter the pipe operator. The pipe operator |> is an operator which allows us to chain the output of one function to the input of another function. This allows us to write more readable code.

brown_passives_df |>
  arrange(passive) |>
  slice_head(n = 5)

# A tibble: 5 × 5
  cat   passive   n_w   n_s name           
  <chr>   <dbl> <dbl> <dbl> <chr>          
1 M         104 14433   873 science fiction
2 R         146 21757   975 humour         
3 L         265 57624  3673 detective      
4 C         283 40749  1459 press reviews  
5 N         290 69909  4438 adventure

The result is the same but the code makes more sense. We can read the code from left to right, top to bottom, which is the order in which the functions are executed.

Dive deeper

The native R pipe |> was introduced in R 4.1.0. If you are using an earlier version of R, you can use {magrittr} to load the pipe operator %>%.

There are certain advantages to using the {magrittr} pipe operator, including the ability to use the pipe operator to pass arguments to functions with placeholders. For more information, see the magrittr documentation.

In addition to being more legible, using the pipe with each function on its own line allows us to add comments to each line of code. For example, the following code block is the same as the previous code block, but with comments added.

# Sort by the passive column and select the first 5 rows
brown_passives_df |>
  arrange(passive) |>
  slice_head(n = 5)

It is a good practice to add comments when writing code, as long as it makes the code more readable and easier to understand for others and for your future self! If the comments are too verbose, and only repeat what the code is ‘saying’, then don’t include them.

Subsetting datasets with {dplyr}

Now that we have a sense of the data, we can subset the dataset to create a variations of our original data frame. We can subset the data frame by selecting columns and/ or rows.

In the R lesson “Packages and Functions”, we saw that base R provides the bracket ([]) operator to subset data frames. {dplyr} provides functions to subset data frames which can be more readable and easier to use.

Let’s first look a selecting columns. The select() function allows us to select columns by name. For example, the following code block will select the passive and n_w columns from the data frame:

# Select the `passive` and `n_w` columns
select(brown_passives_df, passive, n_w)

# A tibble: 15 × 2
   passive    n_w
     <dbl>  <dbl>
 1     892 101196
 2     543  61535
 3     283  40749
 4     351  39029
 5     853  82010
 6    1034 110363
 7    1460 173017
 8     837  69446
 9    2423 181426
10     352  68599
11     265  57624
12     104  14433
13     290  69909
14     290  70476
15     146  21757

Beyond selecting columns, we can also reorder columns and rename columns. For example, the following code block will select the passive and n_w columns, rename the n_w column to num_words, and reorder the columns so that num_words is the first column:

# Select rename and reorder columns
brown_passives_df |>
  select(num_words = n_w, passive)

# A tibble: 15 × 2
   num_words passive
       <dbl>   <dbl>
 1    101196     892
 2     61535     543
 3     40749     283
 4     39029     351
 5     82010     853
 6    110363    1034
 7    173017    1460
 8     69446     837
 9    181426    2423
10     68599     352
11     57624     265
12     14433     104
13     69909     290
14     70476     290
15     21757     146

Dive deeper

select() also provides a number of helper functions to select columns. For example, we can use the starts_with() function inside the select() call to select columns that start with a certain string. Or we can select columns by their vector type by using where(is.character).

For more information, see the select() documentation or use the ?select command in the R console.

By selecting some columns and not others, we have effectively dropped the columns we did not select. If it is more effective to drop columns by name, we can use the select() function with the - operator. For example, the following code block will drop the cat column from the data frame:

# Drop the `n_w` column
brown_passives_df |>
  select(-cat)

# A tibble: 15 × 4
   passive    n_w   n_s name            
     <dbl>  <dbl> <dbl> <chr>           
 1     892 101196  3684 press reportage 
 2     543  61535  2399 press editorial 
 3     283  40749  1459 press reviews   
 4     351  39029  1372 religion        
 5     853  82010  3286 skills / hobbies
 6    1034 110363  4387 popular lore    
 7    1460 173017  6537 belles lettres  
 8     837  69446  2012 miscellaneous   
 9    2423 181426  6311 learned         
10     352  68599  3983 general fiction 
11     265  57624  3673 detective       
12     104  14433   873 science fiction 
13     290  69909  4438 adventure       
14     290  70476  4187 romance         
15     146  21757   975 humour

Let’s now turn our attention to subsetting rows. The filter() function allows us to select rows by a logical condition. For example, the following code block will select rows where the values of the passive column are less than < 1,000:

# Select rows where `passive` is less than 1,000
brown_passives_df |>
  filter(passive < 1000)

# A tibble: 12 × 5
   cat   passive    n_w   n_s name            
   <chr>   <dbl>  <dbl> <dbl> <chr>           
 1 A         892 101196  3684 press reportage 
 2 B         543  61535  2399 press editorial 
 3 C         283  40749  1459 press reviews   
 4 D         351  39029  1372 religion        
 5 E         853  82010  3286 skills / hobbies
 6 H         837  69446  2012 miscellaneous   
 7 K         352  68599  3983 general fiction 
 8 L         265  57624  3673 detective       
 9 M         104  14433   873 science fiction 
10 N         290  69909  4438 adventure       
11 P         290  70476  4187 romance         
12 R         146  21757   975 humour

We can also use the filter() function to select rows by a character string. For example, the following code block will select rows where the values of the name column are equal to religion:

# Select rows where `name` is equal to `religion`
brown_passives_df |>
  filter(name == "religion")

# A tibble: 1 × 5
  cat   passive   n_w   n_s name    
  <chr>   <dbl> <dbl> <dbl> <chr>   
1 D         351 39029  1372 religion

The inequality operator != can be used for character strings as well. To include multiple values, we can use the %in% operator. In this case we can pass a vector of values to the filter() function. For example, the following code block will select rows where the values of the name column are equal to religion or learned:

# Select multiple values
brown_passives_df |>
  filter(name %in% c("religion", "learned", "detective"))

# A tibble: 3 × 5
  cat   passive    n_w   n_s name     
  <chr>   <dbl>  <dbl> <dbl> <chr>    
1 D         351  39029  1372 religion 
2 J        2423 181426  6311 learned  
3 L         265  57624  3673 detective

Dive deeper

For more sophisticated subsetting, we can use the str_detect() function from {stringr} to select rows where the values of the name column contain a certain string. This approach will be enhanced later in the course when we learn about regular expressions.

Writing datasets to a file with {readr}

Finally, we can write data, including data frames, to a file with the write_*() functions from {readr}. The write_*() functions include:

write_csv(): Write a data frame to a CSV file.
write_tsv(): Write a data frame to a TSV file.
write_delim(): Write a data frame to a delimited file with the specified delimiter (|, ;, etc).

To create a distinct data frame from the one we read into R, let’s subset our brown_passives_df data frame by columns and rows to create a new data frame that contains only the passive, n_w, and name columns and only the rows where the values of the passive column are greater than > 1,000 and assign it to the brown_passives_subset_df.

# Subset the data frame
brown_passives_subset_df <-
  brown_passives_df |>
  select(passive, n_w, name) |>
  filter(passive > 1000)

Now the following code block will write the brown_passives_subset_df data frame to a CSV file given the specified file path:

# Write the data frame to a CSV file
write_csv(
  x = brown_passives_subset_df,
  file = "../data/derived/brown_passives_subset.csv"
)

Given the example directory structure we saw earlier, our new file appears in the data/derived/ directory.

project/
├── data/
│   ├── original/
│   │   └── brown_passives_do.csv
│   └── derived/
│       ├── brown_passives_curated.csv
│       ├── brown_passives_curated_dd.csv
│       └── brown_passives_subset.csv
└── code/
    └── reading-inspecting-writing.qmd

There is much more to learn about reading, inspecting, and writing datasets in R. We will introduce more functions and techniques in the coming lessons. For now, we have learned how to read, inspect, and write datasets using R functions and Quarto code blocks!

Check your understanding

{readr} provides functions to read rectangular data into R.
The option in a code block determines whether the code is displayed in the output document.
{dplyr} provides functions to create data dictionaries.
is used to read tab-separated values (TSV) files.
Which function is in {dplyr} is used to select columns by name?
The R pipe operator |> allows us to chain the output of one function to the input of another function.

Lab preparation

In Lab 2 you will have the opportunity to apply the skills you learned in this Recipe to create a Quarto document that reads, inspects, and writes data.

In addition to the knowledge and skills you have developed in Labs 0 and 1, to complete Lab 2, you will need to be able to:

Create code blocks in a Quarto document
Understand the purpose of the label, echo, message, and include options in a code block
Load packages into an R session with library()
Understand how to read and create file relative file paths
Read datasets into R with the read_csv() function
Inspect data frames with {dplyr} functions such as glimpse(), slice_head(), slice_tail(), slice_sample(), and arrange().
Use the |> pipe operator to chain functions together.
Subset data frames with {dplyr} functions such as select() and filter().
Write data frames to a file with the write_csv() function.

References

Evert, Stephanie. 2023. Corpora: Statistics and Data Sets for Corpus Frequency Data. http://SIGIL.R-Forge.R-Project.org/.

Francis, W. Nelson, and Henry Kuçera. 1961. “Brown Corpus of Present Day American English.” http://hdl.handle.net/20.500.12024/0402.

Footnotes

Code block can be written in other programming languages as well such as Python, Bash, etc.↩︎