05. Creating reproducible examples

Helping you help yourself

guides

In this guide, we will explore how to create reproducible examples using {reprex}. Reproducible examples are essential for effective communication and collaboration among data scientists and statisticians.

Outcomes

  • Understand the importance of reproducible examples
  • Create a reproducible example using {reprex} and other tools
  • Share your reproducible example with others

Introduction

What is a reproducible example?

Reproducible examples are crucial for effectively communicating problems, solutions, and ideas in the world of data science. In most cases, a simple description of an issue or concept is not enough to convey the full context of the problem. A reproducible example provides a minimal, self-contained piece of code (and other relevant resources) that demonstrates a specific issue or concept. It includes:

  • A brief description of the problem or question and the expected output
  • The necessary (and only the necessary) data to reproduce the issue
  • The R code used to generate the output
  • The actual output, including any error messages or warnings

Why are reproducible examples important?

You may very well understand the problem you are facing, but others likely will not. By providing sufficient context to understand the problem, you can increase the likelihood of receiving a helpful response. Another reason to create reproducible examples is to help you think through the problem more clearly. By creating a minimal example, you may discover the source of the problem yourself!

Create a reproducible example

The trickiest part of asking a question about R code is often not the question itself, but providing this information in a self-contained, reproducible example. Luckily, there are a few R packages that provide tools to help you create reproducible examples. {reprex}(Bryan et al. 2024), {datapasta}(McBain et al. 2020), and creative uses of {knitr} and base R functions can help you create reproducible examples.

Table 1: Package options for creating reproducible examples
Package Description Use case
{reprex} Creates reproducible examples General use
{datapasta} Copy and paste data frames Data manipulation
{knitr} Swiss Army knife of rendering Extract code from literate programming documents (i.e. Quarto)
Base R functions dput(), dump(), sessionInfo() Represent data as text and report environment settings

In this guide, we will focus on using {reprex} to create reproducible examples. {reprex} is a powerful tool that captures R code, input data, and output in a formatted output that can be easily shared with others. Let’s dive in!

Building blocks

Formatting code and code output

Let’s run through the building blocks of producing a reproducible example. Let’s start with a simple example. We’ll start with the following R code:

# Load packages
library(stringr)
# Sentences to tokenize
x <- c("This is a sentence.", "This is another sentence.")
# Tokenize the sentences
stringr::str_split(x, " ")
[[1]]
[1] "This"      "is"        "a"         "sentence."

[[2]]
[1] "This"      "is"        "another"   "sentence."

First, we need to describe the problem or question the code attempts to address. In this case, we are trying to tokenize the sentences in the vector x. We should also include the expected output as part of the description. Here, the code functions without an error, but it does not seem to produce the desired output. On the one hand, punctuation is not removed and the words are not lowercased. On the other hand, the output is returned in a data structure we may not be familiar with –we’d like to see a data frame with one word per row. Something like this:

So our description could be:

I am trying to tokenize the sentences in the vector x. The expected output is a data frame with one word per row, where punctuation is removed and words are lowercased. The output should look like something like this:

token
this
is
a
sentence

Next, we need to include the necessary R code to reproduce the issue. This is where the {reprex} package comes in handy. We can use the reprex() function to create a reproducible example from the code. The reprex() function will capture the code, input data, and output in a formatted output that can be easily shared with others.

To capture our example code, we first need to load {reprex} in our R session:

library(reprex)

Next, we need to select and copy the code we want to include in the reproducible example. We can then call the reprex() function to create the example:

reprex()

reprex() will find the code we copied to the clipboard, run the code, and will generate a formatted output that includes the code, input data, and results. The output will be displayed in either a browser or preview pane and copied to the clipboard for easy sharing.

Here is the output of the code from the clipboard:

```r
# Load packages
  library(stringr)
    # Sentences to tokenize
      x <- c("This is a sentence.", "This is another sentence.")
        # Tokenize the sentences
          stringr::str_split(x, " ")
#> [[1]]
#> [1] "This"      "is"        "a"         "sentence."
#>
#> [[2]]
#> [1] "This"      "is"        "another"   "sentence."

<sup>Created on 2024-06-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup>
```

The default output of reprex() is a markdown document that can be shared on various platforms such as GitHub, Stack Overflow, or any other markdown-enabled site. The formatted output makes it easy for others to understand the problem and provide a solution. If you plan to share the output on a platform that does not support markdown, you can use the venue argument to specify a different output format. For example, to can get the reprex formatted as:

  • r for plain text
  • rtf for rich text format
  • html for HTML

So for example, to create a reprex formatted as plain text, you can use:

reprex(venue = "r")

This is a handy output if you want to share a code snippet in an email or a chat message!

Including data

In the previous example, our ‘data’ was the vector x. In more complex examples, you may need to include data frames or other data structures. Let’s say we are working on some code that aims to read some data from a file which has two columns doc_id and text, and calculate the number of words per document. The code we’ve written so far is giving us an error, and we need help from the community to debug it.

The code we have so far is:

# Load packages
library(tidyverse)
library(tidytext)

# Read the text file
data <- read_csv("data/text.csv")

# Tokenize the text
tokens_tbl <-
  data |>
  unnest_tokens(word, text) |>
  count(word) |>
  group_by(doc_id) |>
  summarize(doc_words = n())

This code produces the following error:

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `doc_id` is not found.
Run `rlang::last_trace()` to see where the error occurred.

In this case, we need to include a relevant dataset that can be used to reproduce the error. Now, the first thing we should do is to consider if there are any built-in datasets that can be used to reproduce the error. It is always easier use a dataset that is comes with R, as it is readily available to everyone. If there is no (easily accessible) built-in dataset that can be used, we can add our own data to the reprex. Ideally, we should include the smallest amount of data that is necessary to reproduce the error.

To get a better understanding how we might proceed, let’s take a quick look at the data we are working with:

data
# A tibble: 10 × 2
   doc_id text                                                                
    <dbl> <chr>                                                               
 1      1 The Sapir-Whorf hypothesis suggests language influences thought.    
 2      2 Cognitive dissonance occurs when beliefs contradict behaviors.      
 3      3 Plato's allegory of the cave explores perception vs. reality.       
 4      4 Object-oriented programming focuses on creating reusable code.      
 5      5 Chomsky's universal grammar theory proposes innate language ability.
 6      6 The bystander effect explains reduced helping in crowds.            
 7      7 Descartes' 'I think, therefore I am' establishes existence.         
 8      8 Machine learning algorithms improve with more data.                 
 9      9 Phonemes are the smallest units of sound in language.               
10     10 The halting problem proves some computations are undecidable.       

From the output, we can see that the data has two columns: doc_id and text. We can create a small data frame with this structure to include in the reprex. We can use the tribble() function from the {tibble} package to create the data frame:

# Create a small data frame
data <- tibble::tribble(
  ~doc_id, ~text,
  1, "This is a sentence.",
  2, "This is another sentence."
)

data
# A tibble: 2 × 2
  doc_id text                     
   <dbl> <chr>                    
1      1 This is a sentence.      
2      2 This is another sentence.

Now that we have the code to create some sample data, we can replace the call to the read_csv() function with the code to create the data frame. Copy the new code to the clipboard and run reprex() again to create a new reproducible example:

# Load packages
library(tidyverse)
library(tidytext)

# Create a small data frame
data <- tibble::tribble(
  ~doc_id, ~text,
  1, "This is a sentence.",
  2, "This is another sentence."
)

# Tokenize the text
tokens_tbl <-
  data |>
  unnest_tokens(word, text) |>
  count(word) |>
  group_by(doc_id) |>
  summarize(doc_words = n())

We the default setting for markdown output, the reprex will look like this:


```r
# Load packages
library(tidyverse)
library(tidytext)

# Create a small data frame
data <- tibble::tribble(
  ~doc_id, ~text,
  1, "This is a sentence.",
  2, "This is another sentence."
)

# Tokenize the text
tokens_tbl <-
  data |>
  unnest_tokens(word, text) |>
  count(word) |>
  group_by(doc_id) |>
  summarize(doc_words = n())
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `doc_id` is not found.
```

<sup>Created on 2024-06-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup>

Including session information

Another piece of information that can prove key to solving a problem is the R session information. This information describes some important details about your particular R environment. If others are not able to reproduce the error, the session information can help them understand the context in which the error occurred. It’s not always the case that the code itself is the problem, necessarily, but rather the mismatch between the code and the environment in which it is run.

Conviently, the reprex() function can also include the session information in the output. The argument session_info = TRUE will include the session information in the output. This can be a lot of information, but don’t worry, it is common practice to include this information in a reprex.

Here is an example of how to include the session information in the reprex:

reprex(session_info = TRUE)

Now, the reprex will include the session information at the end of the output. As an example, I’ll include the session information in a (formatted) reprex:

# Load packages
library(tidyverse)
library(tidytext)

# Create a small data frame
data <- tibble::tribble(
  ~doc_id, ~text,
  1, "This is a sentence.",
  2, "This is another sentence."
)

# Tokenize the text
tokens_tbl <-
  data |>
  unnest_tokens(word, text) |>
  count(word) |>
  group_by(doc_id) |>
  summarize(doc_words = n())
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `doc_id` is not found.

Created on 2024-06-23 with reprex v2.1.0

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       macOS Sonoma 14.5
#>  system   aarch64, darwin23.4.0
#>  ui       unknown
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-06-23
#>  pandoc   3.2 @ /opt/homebrew/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.4.0)
#>  digest        0.6.35  2024-03-11 [1] CRAN (R 4.4.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      0.24.0  2024-06-10 [1] CRAN (R 4.4.0)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
#>  glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
#>  gtable        0.3.5   2024-04-22 [1] CRAN (R 4.4.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.4.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  janeaustenr   1.0.0   2022-08-26 [1] CRAN (R 4.4.0)
#>  knitr         1.47    2024-05-29 [1] CRAN (R 4.4.0)
#>  lattice       0.22-6  2024-03-20 [3] CRAN (R 4.4.1)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  Matrix        1.7-0   2024-04-26 [3] CRAN (R 4.4.1)
#>  munsell       0.5.1   2024-04-01 [1] CRAN (R 4.4.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.4.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.4.0)
#>  R.oo          1.26.0  2024-01-24 [1] CRAN (R 4.4.0)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
#>  Rcpp          1.0.12  2024-01-09 [1] CRAN (R 4.4.0)
#>  readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
#>  reprex        2.1.0   2024-01-11 [1] CRAN (R 4.4.0)
#>  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
#>  rmarkdown     2.27    2024-05-17 [1] CRAN (R 4.4.0)
#>  scales        1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  SnowballC     0.7.1   2023-04-25 [1] CRAN (R 4.4.0)
#>  stringi       1.8.4   2024-05-06 [1] CRAN (R 4.4.0)
#>  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
#>  styler        1.10.3  2024-04-07 [1] CRAN (R 4.4.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  tidytext    * 0.4.2   2024-04-10 [1] CRAN (R 4.4.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.4.0)
#>  timechange    0.3.0   2024-01-18 [1] CRAN (R 4.4.0)
#>  tokenizers    0.3.0   2022-12-22 [1] CRAN (R 4.4.0)
#>  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
#>  xfun          0.44    2024-05-15 [1] CRAN (R 4.4.0)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.4.0)
#>
#>  [1] /Users/francojc/R/Library
#>  [2] /opt/homebrew/lib/R/4.4/site-library
#>  [3] /opt/homebrew/Cellar/r/4.4.1/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────

Conclusion

In this guide, we have discussed the importance of reproducible examples and demonstrated how to create them using {reprex} in R. By creating clear and concise reprexes, you can effectively communicate problems, solutions, and ideas with your peers and collaborators. Give {reprex} a try and see how it can improve your workflow!

References

Bryan, Jennifer, Jim Hester, David Robinson, Hadley Wickham, and Christophe Dervieux. 2024. Reprex: Prepare Reproducible Example Code via the Clipboard. https://reprex.tidyverse.org.
McBain, Miles, Jonathan Carroll, Sharla Gelfand, Suthira Owlarn, and Garrick Aden-Buie. 2020. Datapasta: R Tools for Data Copy-Pasta. https://github.com/milesmcbain/datapasta.