In this guide, we will explore how to create reproducible examples using {reprex}. Reproducible examples are essential for effective communication and collaboration among data scientists and statisticians.
Outcomes
Understand the importance of reproducible examples
Create a reproducible example using {reprex} and other tools
Share your reproducible example with others
Introduction
What is a reproducible example?
Reproducible examples are crucial for effectively communicating problems, solutions, and ideas in the world of data science. In most cases, a simple description of an issue or concept is not enough to convey the full context of the problem. A reproducible example provides a minimal, self-contained piece of code (and other relevant resources) that demonstrates a specific issue or concept. It includes:
A brief description of the problem or question and the expected output
The necessary (and only the necessary) data to reproduce the issue
The R code used to generate the output
The actual output, including any error messages or warnings
Why are reproducible examples important?
You may very well understand the problem you are facing, but others likely will not. By providing sufficient context to understand the problem, you can increase the likelihood of receiving a helpful response. Another reason to create reproducible examples is to help you think through the problem more clearly. By creating a minimal example, you may discover the source of the problem yourself!
Create a reproducible example
The trickiest part of asking a question about R code is often not the question itself, but providing this information in a self-contained, reproducible example. Luckily, there are a few R packages that provide tools to help you create reproducible examples. {reprex}(Bryan et al. 2024), {datapasta}(McBain et al. 2020), and creative uses of {knitr} and base R functions can help you create reproducible examples.
Table 1: Package options for creating reproducible examples
Package
Description
Use case
{reprex}
Creates reproducible examples
General use
{datapasta}
Copy and paste data frames
Data manipulation
{knitr}
Swiss Army knife of rendering
Extract code from literate programming documents (i.e. Quarto)
Base R functions
dput(), dump(), sessionInfo()
Represent data as text and report environment settings
In this guide, we will focus on using {reprex} to create reproducible examples. {reprex} is a powerful tool that captures R code, input data, and output in a formatted output that can be easily shared with others. Let’s dive in!
Building blocks
Formatting code and code output
Let’s run through the building blocks of producing a reproducible example. Let’s start with a simple example. We’ll start with the following R code:
# Load packageslibrary(stringr)# Sentences to tokenizex <-c("This is a sentence.", "This is another sentence.")# Tokenize the sentencesstringr::str_split(x, " ")
First, we need to describe the problem or question the code attempts to address. In this case, we are trying to tokenize the sentences in the vector x. We should also include the expected output as part of the description. Here, the code functions without an error, but it does not seem to produce the desired output. On the one hand, punctuation is not removed and the words are not lowercased. On the other hand, the output is returned in a data structure we may not be familiar with –we’d like to see a data frame with one word per row. Something like this:
So our description could be:
I am trying to tokenize the sentences in the vector x. The expected output is a data frame with one word per row, where punctuation is removed and words are lowercased. The output should look like something like this:
token
this
is
a
sentence
Next, we need to include the necessary R code to reproduce the issue. This is where the {reprex} package comes in handy. We can use the reprex() function to create a reproducible example from the code. The reprex() function will capture the code, input data, and output in a formatted output that can be easily shared with others.
To capture our example code, we first need to load {reprex} in our R session:
library(reprex)
Next, we need to select and copy the code we want to include in the reproducible example. We can then call the reprex() function to create the example:
reprex()
reprex() will find the code we copied to the clipboard, run the code, and will generate a formatted output that includes the code, input data, and results. The output will be displayed in either a browser or preview pane and copied to the clipboard for easy sharing.
Here is the output of the code from the clipboard:
```r# Load packageslibrary(stringr)# Sentences to tokenize x <-c("This is a sentence.", "This is another sentence.")# Tokenize the sentences stringr::str_split(x, " ")#> [[1]]#> [1] "This" "is" "a" "sentence."#>#> [[2]]#> [1] "This" "is" "another" "sentence."<sup>Created on 2024-06-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup>```
The default output of reprex() is a markdown document that can be shared on various platforms such as GitHub, Stack Overflow, or any other markdown-enabled site. The formatted output makes it easy for others to understand the problem and provide a solution. If you plan to share the output on a platform that does not support markdown, you can use the venue argument to specify a different output format. For example, to can get the reprex formatted as:
r for plain text
rtf for rich text format
html for HTML
So for example, to create a reprex formatted as plain text, you can use:
reprex(venue ="r")
This is a handy output if you want to share a code snippet in an email or a chat message!
Including data
In the previous example, our ‘data’ was the vector x. In more complex examples, you may need to include data frames or other data structures. Let’s say we are working on some code that aims to read some data from a file which has two columns doc_id and text, and calculate the number of words per document. The code we’ve written so far is giving us an error, and we need help from the community to debug it.
The code we have so far is:
# Load packageslibrary(tidyverse)library(tidytext)# Read the text filedata <-read_csv("data/text.csv")# Tokenize the texttokens_tbl <- data |>unnest_tokens(word, text) |>count(word) |>group_by(doc_id) |>summarize(doc_words =n())
This code produces the following error:
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `doc_id` is not found.
Run `rlang::last_trace()` to see where the error occurred.
In this case, we need to include a relevant dataset that can be used to reproduce the error. Now, the first thing we should do is to consider if there are any built-in datasets that can be used to reproduce the error. It is always easier use a dataset that is comes with R, as it is readily available to everyone. If there is no (easily accessible) built-in dataset that can be used, we can add our own data to the reprex. Ideally, we should include the smallest amount of data that is necessary to reproduce the error.
To get a better understanding how we might proceed, let’s take a quick look at the data we are working with:
data
# A tibble: 10 × 2
doc_id text
<dbl> <chr>
1 1 The Sapir-Whorf hypothesis suggests language influences thought.
2 2 Cognitive dissonance occurs when beliefs contradict behaviors.
3 3 Plato's allegory of the cave explores perception vs. reality.
4 4 Object-oriented programming focuses on creating reusable code.
5 5 Chomsky's universal grammar theory proposes innate language ability.
6 6 The bystander effect explains reduced helping in crowds.
7 7 Descartes' 'I think, therefore I am' establishes existence.
8 8 Machine learning algorithms improve with more data.
9 9 Phonemes are the smallest units of sound in language.
10 10 The halting problem proves some computations are undecidable.
From the output, we can see that the data has two columns: doc_id and text. We can create a small data frame with this structure to include in the reprex. We can use the tribble() function from the {tibble} package to create the data frame:
# Create a small data framedata <- tibble::tribble(~doc_id, ~text,1, "This is a sentence.",2, "This is another sentence.")data
# A tibble: 2 × 2
doc_id text
<dbl> <chr>
1 1 This is a sentence.
2 2 This is another sentence.
Now that we have the code to create some sample data, we can replace the call to the read_csv() function with the code to create the data frame. Copy the new code to the clipboard and run reprex() again to create a new reproducible example:
# Load packageslibrary(tidyverse)library(tidytext)# Create a small data framedata <- tibble::tribble(~doc_id, ~text,1, "This is a sentence.",2, "This is another sentence.")# Tokenize the texttokens_tbl <- data |>unnest_tokens(word, text) |>count(word) |>group_by(doc_id) |>summarize(doc_words =n())
We the default setting for markdown output, the reprex will look like this:
```r# Load packageslibrary(tidyverse)library(tidytext)# Create a small data framedata <- tibble::tribble(~doc_id, ~text,1, "This is a sentence.",2, "This is another sentence.")# Tokenize the texttokens_tbl <- data |>unnest_tokens(word, text) |>count(word) |>group_by(doc_id) |>summarize(doc_words =n())#> Error in `group_by()`:#> ! Must group by variables found in `.data`.#> ✖ Column `doc_id` is not found.```<sup>Created on 2024-06-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup>
Including session information
Another piece of information that can prove key to solving a problem is the R session information. This information describes some important details about your particular R environment. If others are not able to reproduce the error, the session information can help them understand the context in which the error occurred. It’s not always the case that the code itself is the problem, necessarily, but rather the mismatch between the code and the environment in which it is run.
Conviently, the reprex() function can also include the session information in the output. The argument session_info = TRUE will include the session information in the output. This can be a lot of information, but don’t worry, it is common practice to include this information in a reprex.
Here is an example of how to include the session information in the reprex:
reprex(session_info =TRUE)
Now, the reprex will include the session information at the end of the output. As an example, I’ll include the session information in a (formatted) reprex:
# Load packageslibrary(tidyverse)library(tidytext)# Create a small data framedata <- tibble::tribble(~doc_id, ~text,1, "This is a sentence.",2, "This is another sentence.")# Tokenize the texttokens_tbl <- data |>unnest_tokens(word, text) |>count(word) |>group_by(doc_id) |>summarize(doc_words =n())#> Error in `group_by()`:#> ! Must group by variables found in `.data`.#> ✖ Column `doc_id` is not found.
In this guide, we have discussed the importance of reproducible examples and demonstrated how to create them using {reprex} in R. By creating clear and concise reprexes, you can effectively communicate problems, solutions, and ideas with your peers and collaborators. Give {reprex} a try and see how it can improve your workflow!
References
Bryan, Jennifer, Jim Hester, David Robinson, Hadley Wickham, and Christophe Dervieux. 2024. Reprex: Prepare Reproducible Example Code via the Clipboard. https://reprex.tidyverse.org.
McBain, Miles, Jonathan Carroll, Sharla Gelfand, Suthira Owlarn, and Garrick Aden-Buie. 2020. Datapasta: R Tools for Data Copy-Pasta. https://github.com/milesmcbain/datapasta.