6 Curate

Outcomes

Describe the importance of data curation in text analysis
Recognize the different types of data formats
Associate the types data formats with the appropriate R programming techniques to curate the data

In this chapter, we will now look at the next step in a text analysis project: data curation. That is, the process of converting the original data we acquire to a tidy dataset. Acquired data can come in a wide variety of formats. These formats tend to signal the richness of the metadata that is included in the file content. We will consider three general types of content formats: (1) unstructured data, (2) structured data, and (3) semi-structured data. Regardless of the file type and the structure of the data, it will be necessary to consider how to curate a dataset such that the structure reflects the basic unit of analysis that we wish to investigate. The resulting dataset will form the base from which we will work to further transform the dataset such that it aligns with the unit(s) of observation required for the analysis method that we will implement. Once the dataset is curated, we will create a data dictionary that describes the dataset and the variables that are included in the dataset for transparency and reproducibility.

Lessons

What: Pattern Matching, Tidy Datasets
How: In an R console, load {swirl}, run swirl(), and follow prompts to select the lesson.
Why: To familiarize yourself with the basics of using the pattern matching syntax Regular Expressions and the {dplyr} package to manipulate data into Tidy datasets.

6.1 Unstructured

The bulk of textual data is of the unstructured variety. Unstructured data is data that has not been organized to make the information contained within machine-readable. Remember that text in itself is not information. Only when given explicit context in the form of metadata does text become informative. Metadata can be linguistic or non-linguistic in nature. So for unstructured data there is little to no metadata directly associated with the data.

Reading data

Some of the common file formats which contain unstructured data include TXT, PDF, and DOCX. Although these formats are unstructured, they are not the same. Reading these files into R requires different techniques and tools.

There are many ways to read TXT files into R and many packages that can be used to do so. For example, using {readr}, we can choose to read the entire file into a single vector of character strings with read_file() or read the file by lines with read_lines() in which each line is a character string in a vector.

Less commonly used in prepared data resources, PDF and DOCX files are more complex than TXT files as they contain formatting and embedded document metadata. However, these attributes are primarily for visual presentation and not for machine-readability. Needless to say, we need an alternate strategy to extract the text content from these files and potentially some of the metadata. For example, using {readtext} (Benoit & Obeng, 2024), we can read the text content from PDF and DOCX files into a single vector of character strings with readtext().

Whether in TXT, PDF, or DOCX format, the resulting data structure will require further processing to convert the data into a tidy dataset.

Orientation

As an example of curating an unstructured source of corpus data, let’s take a look at the Europarl Parallel Corpus (Koehn, 2005). This corpus contains parallel texts (source and translated documents) from the European Parliamentary proceedings between 1996 and 2011 for some 21 European languages.

Let’s assume we selected this corpus because we are interested in researching Spanish to English translations. After consulting the corpus website, downloading the archive file, and inspecting the unarchived structure, we have the file structure seen in Snippet 6.1.

Snippet 6.1 Project directory structure for the Europarl Parallel Corpus

project/
├── process/
│   ├── 1-acquire-data.qmd
│   ├── 2-curate-data.qmd
│   └── ...
├── data/
│   ├── analysis/
│   ├── derived/
│   └── original/
│       │── europarl_do.csv
│       └── europarl/
│           ├── europarl-v7.es-en.en
│           └── europarl-v7.es-en.es
├── reports/
├── DESCRIPTION
├── Makefile
└── README

The europarl_do.csv file contains the data origin information documented as part of the acquisition process. The contents are seen in Table 6.1.

Table 6.1: Data origin: Europarl Corpus

attribute	description
Resource name	Europarl Parallel Corpus
Data source	https://www.statmt.org/europarl/
Data sampling frame	Spanish transcripts from the European Parliament proceedings
Data collection date(s)	1996–2011
Data format	TXT files with ‘.es’ for source (Spanish) and ‘.en’ for target (English) files.
Data schema	Line-by-line unannotated parallel text
License	See: https://www.europarl.europa.eu/legal-notice/en/
Attribution	Please cite the paper: Koehn, P. 2005. ‘Europarl: A Parallel Corpus for Statistical Machine Translation.’ MT Summit X, 12–16.

Now let’s get familiar with the corpus directory structure and the files. In Snippet 6.1, we see that there are two corpus files, europarl-v7.es-en.es and europarl-v7.es-en.en, that contain the source and target language texts, respectively. The file names indicate that the files contain Spanish-English parallel texts. The .es and .en extensions indicate the language of the text.

Looking at the beginning of the .es and .en files, in Snippet 6.2 and Snippet 6.3, we see that the files contain a series of lines in either the source or target language.

Snippet 6.2 europarl-v7.es-en.es file

Reanudación del período de sesiones
Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones.
Como todos han podido comprobar, el gran "efecto del año 2000" no se ha producido. En cambio, los ciudadanos de varios de nuestros países han sido víctimas de catástrofes naturales verdaderamente terribles.
Sus Señorías han solicitado un debate sobre el tema para los próximos días, en el curso de este período de sesiones.
A la espera de que se produzca, de acuerdo con muchos colegas que me lo han pedido, pido que hagamos un minuto de silencio en memoria de todas las víctimas de las tormentas, en los distintos países de la Unión Europea afectados.

We can clearly appreciate that the data is unstructured. That is, there is no explicit metadata associated with the data. The data is just a series of character strings separated by lines. The only information that we can surmise from structure of the data is that the texts are line-aligned and that the data in each file corresponds to source and target languages.

Snippet 6.3 europarl-v7.es-en.en file

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.

Now, before embarking on a data curation process, it is recommendable to define the structure of the data that we want to create. I call this the “idealized structure” of the data. For a curated dataset, we want to reflect the contents of the original data, yet in a tidy format, to maintain the integrity of and connection with the data.

Given what we know about the data, we can define the idealized structure of the data as seen in Table 6.2.

Table 6.2: Idealized structure for the curated Europarl Corpus datasets

variable	name	type	description
type	Document type	character	Contains the type of document, either ‘Source’ or ‘Target’
line	Line	character	Contains the text of each line in the document

Our task now is to develop code that will read the original data and render the idealized structure as a curated dataset for each corpus file. We will then write the datasets to the data/derived/ directory. The code we develop will be added to the 2-curate-data.qmd file. And finally, the datasets will be documented with a data dictionary file.

Tidy the data

To create the idealized dataset structure in Table 6.2, let’s start by reading the files by lines into R. As the files are aligned by lines, we will use the read_lines() function to read the files into character vectors.

Example 6.1

# Load package
library(readr)

# Read Europarl files .es and .en
europarl_es_chr <-
  read_lines("../data/original/europarl-v7.es-en.es")

europarl_en_chr <-
  read_lines("../data/original/europarl-v7.es-en.en")

Using the read_lines() function, we read each line of the files into a character vector. Since the Europarl corpus is a parallel corpus, the lines in the source and target files are aligned. This means that the first line in the source file corresponds to the first line in the target file, the second line in the source file corresponds to the second line in the target file, and so on. This alignment is important for the analysis of parallel corpora, as it allows us to compare the source and target texts line by line.

Let’s inspect our character vectors to ensure that they are of the length and appear to be structured as we expect. We can use the length() function to get the number of lines in each file and the head() function to preview the first few lines of each file.

Example 6.2

# Inspect Spanish character vector
length(europarl_es_chr)

[1] 1965734

head(europarl_es_chr, 5)

[1] "Reanudación del período de sesiones"                                                                                                                                                                                                 
[2] "Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones."                                      
[3] "Como todos han podido comprobar, el gran \"efecto del año 2000\" no se ha producido. En cambio, los ciudadanos de varios de nuestros países han sido víctimas de catástrofes naturales verdaderamente terribles."                    
[4] "Sus Señorías han solicitado un debate sobre el tema para los próximos días, en el curso de este período de sesiones."                                                                                                                
[5] "A la espera de que se produzca, de acuerdo con muchos colegas que me lo han pedido, pido que hagamos un minuto de silencio en memoria de todas las víctimas de las tormentas, en los distintos países de la Unión Europea afectados."

# Inspect English character vector
length(europarl_en_chr)

[1] 1965734

head(europarl_en_chr, 5)

[1] "Resumption of the session"                                                                                                                                                                                                               
[2] "I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period."                         
[3] "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful."                                         
[4] "You have requested a debate on this subject in the course of the next few days, during this part-session."                                                                                                                               
[5] "In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union."

The output of Example 6.2 shows that the number of lines in each file is the same. This is good. If the number of lines in each file was different, we would need to figure out why and fix it. We also see that the content of the files is aligned as expected.

Let’s now create a dataset for each of the character vectors. We will use the tibble() function from {tibble} to create a data frame object with the character vectors as the line column and add a type column with the value ‘Source’ for the Spanish file and ‘Target’ for the English file. We will assign the output two new objects europarl_source_df and europarl_target_df, respectively, as seen in Example 6.3.

Example 6.3

# Create source data frame
europarl_source_df <-
  tibble(
    type = "Source",
    lines = europarl_es_chr
  )
# Create target data frame
europarl_target_df <-
  tibble(
    type = "Target",
    lines = europarl_en_chr
  )

Inspecting these data frames with glimpse() in Example 6.4, we can see if the data frames have the structure we expect.

Example 6.4

# Preview source
glimpse(europarl_source_df)

# Preview target
glimpse(europarl_target_df)

Rows: 1,965,734
Columns: 2
$ type  <chr> "Source", "Source", "Source", "Source", "Source", "Source", "Sou…
$ lines <chr> "Reanudación del período de sesiones", "Declaro reanudado el per…
Rows: 1,965,734
Columns: 2
$ type  <chr> "Target", "Target", "Target", "Target", "Target", "Target", "Tar…
$ lines <chr> "Resumption of the session", "I declare resumed the session of t…

We now have our type and lines columns and the associated observations for our idealized dataset, in Table 6.2. We can now write these datasets to the data/derived/ directory using write_csv() and create corresponding data dictionary files.

6.2 Structured

Structured data already reflects the physical and semantic structure of a tidy dataset. This means that the data is already in a tabular format and the relationships between columns and rows are already well-defined. Therefore, the heavy lifting of curating the data is already done. There are two remaining questions, however, that need to be taken into account. One, logistical question, is what file format the dataset is in and how to read it into R. And the second, more research-based, is whether the data may benefit from some additional curation and documentation to make it more amenable to analysis and more understandable to others.

Reading datasets

Let’s consider some common formats for structured data, i.e. datasets, and how to read them into R. First, we will consider R-native formats, such as package datasets and RDS files. Then will consider non-native formats, such as relational databases and datasets produced by other software. Finally, we will consider software agnostic formats, such as CSV.

R and some R packages provide structured datasets that are available for use directly within R. For example, {languageR} (Baayen & Shafaei-Bajestan, 2019) provides the dative dataset, which is a dataset containing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection. {janeaustenr} (Silge, 2022) provides the austen_books dataset, which is a dataset of Jane Austen’s novels. Package datasets are loaded into an R session using either the data() function, if the package is loaded, or the :: operator, if the package is not loaded, data(dative) or languageR::dative, respectively.

Dive deeper

To explore the available datasets in a package, you can use the data(package = "package_name") function. For example, data(package = "languageR") will list the datasets available in {languageR}. You can also explore all the datasets available in the loaded packages with the data() function using no arguments. For example, data().

R also provides a native file format for storing R objects, the RDS file. Any R object, including data frames, can be written from an R session to disk by using the write_rds() function from readr. The .rds files will be written to disk in a binary format that is not human-readable, which is not ideal for transparent data sharing. However, the files and the R objects can be read back into an R session using the read_rds() function with all the attributes intact, such as vector types, factor levels, etc.

R provides a suite of tools for importing data from non-native structured sources such as databases and datasets from software such as SPSS, SAS, and Stata. For instance, if you are working with data stored in a relational database such as MySQL, PostgreSQL, or SQLite, you can use {DBI} (R Special Interest Group on Databases (R-SIG-DB), Wickham, & Müller, 2024) to connect to the database and {dbplyr} (Wickham, Girlich, & Ruiz, 2024) to query the database using the SQL language. Files from SPSS (.sav), SAS (.sas7bdat), and Stata (.dta) can be read into R using {haven} (Wickham, Miller, & Smith, 2023).

Software agnostic file formats include delimited files, such as CSV, TSV, etc. These file formats lack the robust structural attributes of the other formats, but balance this shortcoming by storing structured data in more accessible, human-readable format. Delimited files are plain text files which use a delimiter, such as a comma (,), tab (\t), or pipe (|), to separate the columns and rows. For example, a CSV file is a delimited file where the columns and rows are separated by commas, as seen in Example 6.5.

Example 6.5

column_1,column_2,column_3
row 1 value 1,row 1 value 2,row 1 value 3
row 2 value 1,row 2 value 2,row 2 value 3

Given the accessibility of delimited files, they are a common format for sharing structured data in reproducible research. It is not surprising, then, that this is the format which we have chosen for the derived datasets in this book.

Orientation

With an understanding of the various structured formats, we can now turn to considerations about how the original dataset is structured and how that structure is to be used for a given research project. As an example, we will work with the CABNC datasets acquired in Chapter 5. The structure of the original dataset is shown in Snippet 6.4.

Snippet 6.4 Directory structure for the CABNC datasets

data/
├── analysis/
├── derived/
└── original/
    ├── cabnc_do.csv
    └── cabnc/
        ├── participants.csv
        ├── token_types.csv
        ├── tokens.csv
        ├── transcripts.csv
        └── utterances.csv

In addition to other important information, the data origin file cabnc_do.csv shown in Table 6.3 informs us the datasets are related by a common variable.

Table 6.3: Data origin: CABNC datasets

attribute	description
Resource name	CABNC.
Data source	https://ca.talkbank.org/access/CABNC.html, doi:10.21415/T55Q5R
Data sampling frame	Over 400 British English speakers from across the UK stratified age, gender, social group, and region, and recording their language output over a set period of time.
Data collection date(s)	1992.
Data format	CSV Files
Data schema	The recordings are linked by `filename` and the participants are linked by `who`.
License	CC BY NC SA 3.0
Attribution	Saul Albert, Laura E. de Ruiter, and J.P. de Ruiter (2015) CABNC: the Jeffersonian transcription of the Spoken British National Corpus. https://saulalbert.github.io/CABNC/.

The CABNC datasets are structured in a relational format, which means that the data is stored in multiple tables that are related to each other. The tables are related by a common column or set of columns, which are called keys. A key is used to join the tables together to create a single dataset. There are two keys in the CABNC datasets, filename and who. Each variable corresponds to recording- and/ or participant-oriented datasets.

Now, let’s envision a scenario in which we are preparing our data for a study that aims to investigate the relationship between speaker demographics and utterances. In their original format, the CABNC datasets separate information about utterances and speakers in separate datasets, cabnc_utterances and cabnc_participants, respectively. Ideally, we would like to curate these datasets such that the information about the utterances and the speakers are ready to be joined as part of the dataset transformation process, while still retaining the relevant original structure. This usually involves removing redundant and/ or uninformative variables and/ or adjusting variable names and writing these datasets and their documentation files to disk.

Tidy the dataset

With these goals in mind, let’s start the process of curation by reading the relevant datasets into an R session. Since we are working with CSV files we will use the read_csv() function, as seen in Example 6.6.

Example 6.6

# Read the relevant datasets
cabnc_utterances <-
  read_csv("data/cabnc/original/utterances.csv")
cabnc_participants <-
  read_csv("data/cabnc/original/participants.csv")

The next step is to inspect the structure of the datasets. We can use the glimpse() function for this task.

Example 6.7

#Preview the structure of the datasets
glimpse(cabnc_utterances)
glimpse(cabnc_participants)

Rows: 235,901
Columns: 10
$ filename  <chr> "KB0RE000", "KB0RE000", "KB0RE000", "KB0RE000", "KB0RE000", …
$ path      <chr> "ca/CABNC/KB0/KB0RE000", "ca/CABNC/KB0/KB0RE000", "ca/CABNC/…
$ utt_num   <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ who       <chr> "PS002", "PS006", "PS002", "PS006", "PS002", "PS006", "PS002…
$ role      <chr> "Unidentified", "Unidentified", "Unidentified", "Unidentifie…
$ postcodes <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ gems      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ utterance <chr> "You enjoyed yourself in America", "Eh", "did you", "Oh I co…
$ startTime <dbl> 0.208, 2.656, 2.896, 3.328, 5.088, 6.208, 8.320, 8.480, 10.2…
$ endTime   <dbl> 2.67, 2.90, 3.33, 5.26, 6.02, 8.50, 9.31, 11.23, 14.34, 15.9…

Rows: 6,190
Columns: 13
$ filename  <chr> "KB0RE004", "KB0RE004", "KB0RE004", "KB0RE006", "KB0RE006", …
$ path      <chr> "ca/CABNC/0missing/KB0RE004", "ca/CABNC/0missing/KB0RE004", …
$ who       <chr> "PS008", "PS009", "KB0PSUN", "PS007", "PS008", "PS009", "KB0…
$ name      <chr> "John", "Gethyn", "Unknown_speaker", "Alan", "John", "Gethyn…
$ role      <chr> "Unidentified", "Unidentified", "Unidentified", "Unidentifie…
$ language  <chr> "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng…
$ monthage  <dbl> 481, 481, 13, 949, 481, 481, 13, 637, 565, 13, 637, 565, 13,…
$ age       <chr> "40;01.01", "40;01.01", "1;01.01", "79;01.01", "40;01.01", "…
$ sex       <chr> "male", "male", "male", "male", "male", "male", "male", "mal…
$ numwords  <dbl> 28, 360, 156, 1610, 791, 184, 294, 93, 3, 0, 128, 24, 0, 150…
$ numutts   <dbl> 1, 9, 27, 7, 5, 7, 6, 5, 1, 0, 11, 6, 0, 110, 74, 96, 12, 1,…
$ avgutt    <dbl> 28.00, 40.00, 5.78, 230.00, 158.20, 26.29, 49.00, 18.60, 3.0…
$ medianutt <dbl> 28, 39, 5, 84, 64, 9, 3, 15, 3, 0, 9, 3, 0, 7, 6, 4, 3, 12, …

From visual inspection of the output of Example 6.7 we can see that there are common variables in both datasets. In particular, we see the filename and who variables mentioned in the data origin file cabnc_do.csv.

The next step is to consider the variables that will be useful for future analysis. Since we are creating a curated dataset, the goal will be to retain as much information as possible from the original datasets. There are cases, however, in which there may be variables that are not informative and, thus, will not prove useful for any analysis. These removable variables tend to be of one of two types: variables which show no variation across observations and variables where the information is redundant.

As an example case, let’s look at the cabnc_participants data frame. We can use the skim() function from {skimr} to get a summary of the variables in the dataset. We can add the yank() function to look at variable types one at a time. We will start with the character variables, as seen in Example 6.8.

Example 6.8

# Summarize character variables
cabnc_participants |>
  skim() |>
  yank("character")


── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 filename              0             1   8   8     0     2020          0
2 path                  0             1  21  26     0     2020          0
3 who                   0             1   4   7     0      581          0
4 name                  0             1   3  25     0      269          0
5 role                  0             1  12  12     0        1          0
6 language              0             1   3   3     0        1          0
7 age                   0             1   7   8     0       83          0
8 sex                   0             1   4   6     0        2          0

We see from the output in Example 6.8, that the variables role and language have a single unique value. This means that these variables do not show any variation across observations. We will remove these variables from the dataset.

Continuing on, let’s look for redundant variables. We see that the variables filename and path have the same number of unique values. And if we combine this with the visual summary in Example 6.7, we can see that the path variable is redundant. We will remove this variable from the dataset.

Another potentially redundant set of variables are who and name —both of which are speaker identifiers. The who variable is a unique identifier, but there may be some redundancy with the name variable, that is, there may be two speakers with the same name. We can check this by looking at the number of unique values in the who and name variables from the skim() output in Example 6.8. who has 568 unique values and name has 269 unique values. This suggests that there are multiple speakers with the same name.

Another way to explore this is to look at the number of unique values in the who variable for each unique value in the name variable. We can do this using the group_by() and summarize() functions from {dplyr}. For each value of name, we will count the number of unique values in who with n_distinct() and then sort the results in descending order.

Example 6.9

cabnc_participants |>
  group_by(name) |>
  summarize(n = n_distinct(who)) |>
  arrange(desc(n)) |>
  slice_head(n = 5)

# A tibble: 5 × 2
  name                          n
  <chr>                     <int>
1 None                         59
2 Unknown_speaker              59
3 Group_of_unknown_speakers    21
4 Chris                         9
5 David                         9

It is good that we performed the check in Example 6.9 beforehand. In addition to speakers with the same name, such as ‘Chris’ and ‘David’, we also have multiple speakers with generic codes, such as ‘None’ and ‘Unknown_speaker’. It is clear that name is redundant.

With this in mind, we can then safely remove the following variables from the dataset: role, language, name, and path. To drop variables from a data frame we can use the select() function in combination with the - operator. The - operator tells the select() function to drop the variable that follows it.

Example 6.10

# Drop variables
cabnc_participants <-
  cabnc_participants |>
  select(-role, -language, -name, -path)

# Preview the dataset
glimpse(cabnc_participants)

Rows: 6,190
Columns: 9
$ filename  <chr> "KB0RE004", "KB0RE004", "KB0RE004", "KB0RE006", "KB0RE006", …
$ who       <chr> "PS008", "PS009", "KB0PSUN", "PS007", "PS008", "PS009", "KB0…
$ monthage  <dbl> 481, 481, 13, 949, 481, 481, 13, 637, 565, 13, 637, 565, 13,…
$ age       <chr> "40;01.01", "40;01.01", "1;01.01", "79;01.01", "40;01.01", "…
$ sex       <chr> "male", "male", "male", "male", "male", "male", "male", "mal…
$ numwords  <dbl> 28, 360, 156, 1610, 791, 184, 294, 93, 3, 0, 128, 24, 0, 150…
$ numutts   <dbl> 1, 9, 27, 7, 5, 7, 6, 5, 1, 0, 11, 6, 0, 110, 74, 96, 12, 1,…
$ avgutt    <dbl> 28.00, 40.00, 5.78, 230.00, 158.20, 26.29, 49.00, 18.60, 3.0…
$ medianutt <dbl> 28, 39, 5, 84, 64, 9, 3, 15, 3, 0, 9, 3, 0, 7, 6, 4, 3, 12, …

Now we have a frame with 9 more informative variables which describe the participants. We would then repeat this process for the cabnc_utterances dataset to remove redundant and uninformative variables.

Another, optional step, is to rename and/ or organize the order the variables to make the dataset more understandable. Let’s organize the columns to read left to right from most general to most specific. Again, we turn to the select() function, this time including the variables in the order we want them to appear in the dataset. We will take this opportunity to rename some of the variable names so that they are more informative.

Example 6.11

# Rename variables
cabnc_participants <-
  cabnc_participants |>
  select(
    doc_id = filename,
    part_id = who,
    part_age = monthage,
    part_sex = sex,
    num_words = numwords,
    num_utts = numutts,
    avg_utt_len = avgutt,
    median_utt_len = medianutt
  )

# Preview the dataset
glimpse(cabnc_participants)

Rows: 6,190
Columns: 8
$ doc_id         <chr> "KB0RE004", "KB0RE004", "KB0RE004", "KB0RE006", "KB0RE0…
$ part_id        <chr> "PS008", "PS009", "KB0PSUN", "PS007", "PS008", "PS009",…
$ part_age       <dbl> 481, 481, 13, 949, 481, 481, 13, 637, 565, 13, 637, 565…
$ part_sex       <chr> "male", "male", "male", "male", "male", "male", "male",…
$ num_words      <dbl> 28, 360, 156, 1610, 791, 184, 294, 93, 3, 0, 128, 24, 0…
$ num_utts       <dbl> 1, 9, 27, 7, 5, 7, 6, 5, 1, 0, 11, 6, 0, 110, 74, 96, 1…
$ avg_utt_len    <dbl> 28.00, 40.00, 5.78, 230.00, 158.20, 26.29, 49.00, 18.60…
$ median_utt_len <dbl> 28, 39, 5, 84, 64, 9, 3, 15, 3, 0, 9, 3, 0, 7, 6, 4, 3,…

The variable order is organized after running Example 6.11. Now let’s sort the rows by doc_id and part_id so that the dataset is sensibly organized. The arrange() function takes a data frame and a list of variables to sort by, in the order they are listed.

Example 6.12

# Sort rows
cabnc_participants <-
  cabnc_participants |>
  arrange(doc_id, part_id)

# Preview the dataset
cabnc_participants |>
  slice_head(n = 5)

# A tibble: 5 × 8
  doc_id part_id part_age part_sex num_words num_utts avg_utt_len median_utt_len
  <chr>  <chr>      <dbl> <chr>        <dbl>    <dbl>       <dbl>          <dbl>
1 KB0RE… KB0PSUN       13 male             2        2        1                 1
2 KB0RE… PS002        721 female         759       74       10.3               7
3 KB0RE… PS006        601 male           399       64        6.23              5
4 KB0RE… KB0PSUN       13 male             7        3        2.33              1
5 KB0RE… PS005        481 female         257       32        8.03              8

Applying the sorting in Example 6.12, we can see that the utterances are now our desired order, a dataset that reads left to right from document to participant-oriented attributes and top to bottom by document and participant.

6.3 Semi-structured

Between unstructured and structured data falls semi-structured data. And as the name suggests, it is a hybrid data format. This means that there will be important structured metadata included with unstructured elements. The file formats and approaches to encoding the structured aspects of the data vary widely from resource to resource and therefore often require more detailed attention to the structure of the data and often include more sophisticated programming strategies to curate the data to produce a tidy dataset.

Reading data

The file formats associated with semi-structured data include a wide range. These include file formats conducive to more structured-leaning data, such as XML, HTML, and JSON, and file formats with more unstructured-leaning data, such as annotated TXT files. Annotated TXT files may in fact appear with the .txt extension, but may also appear with other, sometimes resource-specific, extensions, such as .utt for the Switchboard Dialog Act Corpus or .cha for the Child Language Data Exchange System (CHILDES) annotation files, for example.

The more structured file formats use standard conventions and therefore can be read into an R session with format-specific functions. Say, for example, we are working with data in a JSON file format. We can read the data into an R session with the read_json() function from {jsonlite} (Ooms, 2023). For XML and HTML files, {rvest} (Wickham, 2024) provides the read_xml() and read_html() functions.

Semi-structured data in TXT files can be read either as a file or by lines. The choice of which approach to take depends on the structure of the data. If the data structure is line-based, then read_lines() often makes more sense than read_file(). However, in some cases, the data may be structured in a way that requires the entire file to be read into an R session and then subsequently parsed.

Orientation

To provide an example of the curation process using semi-structured data, we will work with the Europarl corpus of native, non-native and translated texts (ENNTT) corpus (Nisioi, Rabinovich, Dinu, & Wintner, 2016). The ENNTT corpus contains native and translated English drawn from European Parliament proceedings. Let’s look at the directory structure for the ENNTT corpus in Snippet 6.5.

Snippet 6.5 Data directory structure for the ENNTT corpus

data/
├── analysis/
├── derived/
└── original/
    ├── enntt_do.csv
    └── enntt/
        ├── natives.dat
        ├── natives.tok
        ├── nonnatives.dat
        ├── nonnatives.tok
        ├── translations.dat
        └── translations.tok

We now inspect the data origin file for the ENNTT corpus, enntt_do.csv, in Table 6.4.

Table 6.4: Data origin: ENNTT Corpus

attribute	description
Resource name	Europarl corpus of Native, Non-native and Translated Texts — ENNTT
Data source	https://github.com/senisioi/enntt-release
Data sampling frame	English, European Parliament texts, transcribed discourse, political genre
Data collection date(s)	Not specified in the repository
Data format	.tok, .dat
Data schema	.tok files contain the actual text; .dat files contain the annotations corresponding to each line in the .tok files.
License	Not specified. Contact the authors for more information.
Attribution	Nisioi, S., Rabinovich, E., Dinu, L. P., & Wintner, S. (2016). A corpus of native, non-native and translated texts. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).

According to the data origin file, there are two important file types, .dat and .tok. The .dat files contain annotations and the .tok files contain the actual text. Let’s inspect the first couple of lines in the .dat file for the native speakers, nonnatives.dat, in Snippet 6.6.

Snippet 6.6 Example .dat file for the non-native speakers

<LINE STATE="Poland" MEPID="96779" LANGUAGE="EN" NAME="Danuta Hübner," SEQ_SPEAKER_ID="184" SESSION_ID="ep-05-11-17"/>
<LINE STATE="Poland" MEPID="96779" LANGUAGE="EN" NAME="Danuta Hübner," SEQ_SPEAKER_ID="184" SESSION_ID="ep-05-11-17"/>

We see that the .dat file contains annotations for various session and speaker attributes. The format of the annotations is XML-like. XML is a form of markup language, such as YAML, JSON, etc. Markup languages are used to annotate text with additional information about the structure, meaning, and/ or presentation of text. In XML, structure is built up by nesting of nodes. The nodes are named with tags, which are enclosed in angle brackets, < and >. Nodes are opened with <TAG> and closed with </TAG>. In Snippet 6.7 we see an example of a simple XML file structure.

Snippet 6.7 Example .xml file structure

<?xml version="1.0" encoding="UTF-8"?>
<book category="fiction">
  <title lang="en">The Catcher in the Rye</title>
  <author>J.D. Salinger</author>
  <year>1951</year>
</book>

In Snippet 6.7 there are four nodes, three of which are nested inside of the <book> node. The <book> node in this example is the root node. XML files require a root node. Nodes can also have attributes, such as the category attribute in the <book> node, but they are not required. Furthermore, XML files also require a declaration, which is the first line in Snippet 6.7. The declaration specifies the version of XML used and the encoding.

So the .dat file is not strict XML, but is similar in that it contains nodes and attributes. An XML variant you are likely familiar with, HTML, has more relaxed rules than XML. HTML is a markup language used to annotate text with information about the organization and presentation of text on the web that does not require a root node or a declaration —much like our .dat file. So suffice it to say that the .dat file can safely be treated as HTML.

And the .tok file for native speakers, nonnatives.tok, in Snippet 6.8, shows the actual text for each line in the corpus.

Snippet 6.8 Example .tok file for the non-native speakers

The Commission is following with interest the planned construction of a nuclear power plant in Akkuyu , Turkey and recognises the importance of ensuring that the construction of the new plant follows the highest internationally accepted nuclear safety standards .
According to our information , the decision on the selection of a bidder has not been taken yet .

In a study in which we are interested in contrasting the language of natives and non-natives, we will want to combine the .dat and .tok files for these groups of speakers.

The question is what attributes we want to include in the curated dataset. Given the research focus, we will not need the LANGUAGE or NAME attributes. We may want to modify the attribute names so they are a bit more descriptive.

An idealized version of the curated dataset based on this criteria is shown in Table 6.5.

Table 6.5: Idealized structure for the curated ENNTT Corpus datasets

variable	name	type	description
session_id	Session ID	character	Unique identifier for each session.
speaker_id	Speaker ID	integer	Unique identifier for each speaker.
state	State	character	The political state of the speaker.
type	Type	character	Indicates whether the text is native or non-native
session_seq	Session Sequence	integer	The sequence of the text in the session.
text	Text	character	Contains the text of the line, and maintains the structure of the original data.

Tidy the data

Now that we have a better understanding of the corpus data and our target curated dataset structure, let’s work to extract and organize the data from the native and non-native files.

The general approach we will take is, for native and then non-natives, to read in the .dat file as an HTML file and then extract the line nodes and their attributes combining them into a data frame. Then we’ll read in the .tok file as a text file and then combine the two into a single data frame.

Starting with the natives, we use {rvest} to read in the .dat file as an XML file with the read_html() function and then extract the line nodes with the html_elements() function as in Example 6.13.

Example 6.13

# Load packages
library(rvest)

# Read in *.dat* file as HTML
ns_dat_lines <-
  read_html("../data/original/enntt/natives.dat") |>
  html_elements("line")

# Inspect
class(ns_dat_lines)
typeof(ns_dat_lines)
length(ns_dat_lines)

[1] "xml_nodeset"
[1] "list"
[1] 116341

We can see that the ns_dat_lines object is a special type of list, xml_nodeset which contains 116,341 line nodes. Let’s now jump out of sequence and read in the .tok file as a text file, in Example 6.14, again by lines using read_lines(), and compare the two to make sure that our approach will work.

Example 6.14

# Read in *.tok* file by lines
ns_tok_lines <-
  read_lines("../data/enntt/original/natives.tok")

# Inspect
class(ns_tok_lines)
typeof(ns_tok_lines)
length(ns_tok_lines)

[1] "character"
[1] "character"
[1] 116341

We do, in fact, have the same number of lines in the .dat and .tok files. So we can proceed with extracting the attributes from the line nodes and combining them with the text from the .tok file.

Let’s start by listing the attributes of the first line node in the ns_dat_lines object. To do this we will draw on the pluck() function from {purrr} (Wickham & Henry, 2023) to extract the first line node. Then, we use the html_attrs() function to get the attribute names and the values, as in Example 6.15.

Example 6.15

# Load package
library(purrr)

# List attributes line node 1
ns_dat_lines |>
  pluck(1) |>
  html_attrs()

            state             mepid          language              name 
 "United Kingdom"            "2099"              "EN" "Evans, Robert J" 
   seq_speaker_id        session_id 
              "2"     "ep-00-01-17"

No surprise here, these are the same attributes we saw in the .dat file preview in Snippet 6.6. At this point, it’s good to make a plan on how to associate the attribute names with the column names in our curated dataset.

session_id = session_id
speaker_id = MEPID
state = state
session_seq = seq_speaker_id

We can do this one attribute at a time using the html_attr() function and then combine them into a data frame with the tibble() function as in Example 6.16.

Example 6.16

# Extract attributes from first line node
session_id <- ns_dat_lines |> pluck(1) |> html_attr("session_id")
speaker_id <- ns_dat_lines |> pluck(1) |> html_attr("mepid")
state <- ns_dat_lines |> pluck(1) |> html_attr("state")
session_seq <- ns_dat_lines |> pluck(1) |> html_attr("seq_speaker_id")

# Combine into data frame
tibble(session_id, speaker_id, state, session_seq)

# A tibble: 1 × 4
  session_id  speaker_id state          session_seq
  <chr>       <chr>      <chr>          <chr>      
1 ep-00-01-17 2099       United Kingdom 2

The results from Example 6.16 show that the attributes have been extracted and mapped to our idealized column names, but this would be tedious to do for each line node. A function to extract attributes and values from a line and add them to a data frame would help simplify this process. The function in Example 6.17 does just that.

Example 6.17

# Function to extract attributes from line node
extract_dat_attrs <- function(line_node) {
  session_id <- line_node |> html_attr("session_id")
  speaker_id <- line_node |> html_attr("mepid")
  state <- line_node |> html_attr("state")
  session_seq <- line_node |> html_attr("seq_speaker_id")

  tibble(session_id, speaker_id, state, session_seq)
}

It’s a good idea to test out the function to verify that it works as expected. We can do this by passing the various indices to the ns_dat_lines object to the function as in Example 6.18.

Example 6.18

# Test function
ns_dat_lines |> pluck(1) |> extract_dat_attrs()
ns_dat_lines |> pluck(20) |> extract_dat_attrs()
ns_dat_lines |> pluck(100) |> extract_dat_attrs()

# A tibble: 1 × 4
  session_id  speaker_id state          session_seq
  <chr>       <chr>      <chr>          <chr>      
1 ep-00-01-17 2099       United Kingdom 2          
# A tibble: 1 × 4
  session_id  speaker_id state          session_seq
  <chr>       <chr>      <chr>          <chr>      
1 ep-00-01-17 1309       United Kingdom 40         
# A tibble: 1 × 4
  session_id  speaker_id state          session_seq
  <chr>       <chr>      <chr>          <chr>      
1 ep-00-01-18 4549       United Kingdom 28

It looks like the extract_dat_attrs() function is ready for prime-time. Let’s now apply it to all of the line nodes in the ns_dat_lines object using the map_dfr() function from {purrr} as in Example 6.19.

Example 6.19

# Extract attributes from all line nodes
ns_dat_attrs <-
  ns_dat_lines |>
  map_dfr(extract_dat_attrs)

# Inspect
glimpse(ns_dat_attrs)

Rows: 116,341
Columns: 4
$ session_id  <chr> "ep-00-01-17", "ep-00-01-17", "ep-00-01-17", "ep-00-01-17"…
$ speaker_id  <chr> "2099", "2099", "2099", "4548", "4548", "4541", "4541", "4…
$ state       <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
$ session_seq <chr> "2", "2", "2", "4", "4", "12", "12", "12", "12", "12", "12…

Dive deeper

The map*() functions from {purrr} are a family of functions that apply a function to each element of a vector, list, or data frame. The map_dfr() function is a variant of the map() function that returns a data frame that is the result of row-binding the results, hence *_dfr.

We can see that the ns_dat_attrs object is a data frame with 116,341 rows and 4 columns, just has we expected. We can now combine the ns_dat_attrs data frame with the ns_tok_lines vector to create a single data frame with the attributes and the text. This is done with the mutate() function assigning the ns_tok_lines vector to a new column named text as in Example 6.20.

Example 6.20

# Combine attributes and text
ns_dat <-
  ns_dat_attrs |>
  mutate(text = ns_tok_lines)

# Inspect
glimpse(ns_dat)

Rows: 116,341
Columns: 5
$ session_id  <chr> "ep-00-01-17", "ep-00-01-17", "ep-00-01-17", "ep-00-01-17"…
$ speaker_id  <chr> "2099", "2099", "2099", "4548", "4548", "4541", "4541", "4…
$ state       <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
$ session_seq <chr> "2", "2", "2", "4", "4", "12", "12", "12", "12", "12", "12…
$ text        <chr> "You will be aware from the press and television that ther…

This is the data for the native speakers. We can now repeat this process for the non-native speakers, or we can create a function to do it for us.

Consider this

Using the previous code as a guide, consider what steps you would need to take to create a function to combine the .dat and .tok files for the non-native speakers (and/ or the translations). What arguments would the function take? What would the function return? What would the processing steps be? In what order would the steps be executed?

After applying the curation steps to both the native and non-native datasets, we will have two data frames, enntt_ns_df and enntt_nns_df, respectively that meet the idealized structure for the curated ENNTT Corpus datasets, as shown in Table 6.5. The enntt_ns_df and enntt_nns_df data frames are ready to be written to disk and documented.

6.4 Documentation

After applying the curation steps to our data, we will now want to write the dataset to disk and to do our best to document the process and the resulting dataset.

Since data frames are a tabular, we will have various options for the file type to write. Many of these formats are software-specific, such as *.xlsx for Microsoft Excel, *.sav for SPSS, *.dta for Stata, and *.rds for R. We will use the *.csv format since it is a common format that can be read by many software packages. We will use the write_csv() function from {readr} to write the dataset to disk.

Now the question is where to save our CSV file. Since our dataset is derived by our work, we will added it to the derived/ directory. If you are working with multiple data sources within the same project, it is a good idea to create a sub-directory for each dataset. This will help keep the project organized and make it easier to find and access the datasets.

The final step, as always, is to provide documentation. For datasets the documentation is a data dictionary, as discussed in Section 2.3.2. As with data origin files, you can use spreadsheet software to create and edit the data dictionary.

Tip

The create_data_dictionary() function from {qtkit} provides a rudimentary data dictionary template by default. However, the model argument let’s you take advantage of OpenAI’s text generation models to generate a more detailed data dictionary for you to edit. See the function documentation for more information.

In {qtkit} we have a function, create_data_dictionary() that will generate the scaffolding for a data dictionary. The function takes two arguments, data and file_path. It reads the dataset columns and provides a template for the data dictionary.

An example of a data dictionary, a data dictionary for the enntt_ns_df dataset is shown in Table 6.6.

Table 6.6: Data dictionary: enntt_ns_df dataset

variable	name	type	description
session_id	Session ID	categorical	Unique identifier for each session
speaker_id	Speaker ID	categorical	Unique identifier for each speaker
state	State	categorical	Name of the state or country the session is linked to
session_seq	Session Sequence	ordinal	Sequence number in the session
text	Text	categorical	Text transcript of the session
type	Type	categorical	The type of the speaker, whether native or nonnative

Activities

The following activities build on your skills and knowledge to use R to read, inspect, and write data and datasets in R. In these activities you will have an opportunity to learn and apply your skills and knowledge to the task of curating datasets. This is a vital component of text analysis research that uses unstructured and semi-structured data.

Recipe

What: Organizing and documenting datasets
How: Read Recipe 6, complete comprehension check, and prepare for Lab 6.
Why: To rehearse methods for deriving tidying datasets to use as the base for further project-specific purposes. We will explore how regular expressions are helpful in developing strategies for matching, extracting, and/ or replacing patterns in character sequences and how to organize datasets in rows and columns. We will also explore how to document datasets in a data dictionary.

Lab

What: Taming data
How: Fork, clone, and complete the steps in Lab 6.
Why: To gain experience working with coding strategies to manipulate data using Tidyverse functions and regular expressions, to practice reading/ writing data from/ to disk, and to implement organizational strategies for organizing and documenting a dataset in reproducible fashion.

Summary

In this chapter we looked at the process of structuring data into a dataset. This included a discussion on three main types of data —unstructured, structured, and semi-structured. The level of structure of the original data(set) will vary from resource to resource and by the same token so will the file format used to support the level of metadata included. The results from data curation results in a dataset that is saved separate from the original data in order to maintain modularity between what the data(set) look like before we intervene and afterwards. Since there can be multiple analysis approaches applied to the original data in a research project, this curated dataset serves as the point of departure for each of the subsequent datasets derived from the transformational steps. In addition to the code we use to derive the curated dataset’s structure, we also include a data dictionary which documents the variables and measures in the curated dataset.