attribute | description |
---|---|
Resource name | Europarl Parallel Corpus |
Data source | https://www.statmt.org/europarl/ |
Data sampling frame | Spanish transcripts from the European Parliament proceedings |
Data collection date(s) | 1996–2011 |
Data format | TXT files with ‘.es’ for source (Spanish) and ‘.en’ for target (English) files. |
Data schema | Line-by-line unannotated parallel text |
License | See: https://www.europarl.europa.eu/legal-notice/en/ |
Attribution | Please cite the paper: Koehn, P. 2005. ‘Europarl: A Parallel Corpus for Statistical Machine Translation.’ MT Summit X, 12–16. |
6 Curate
In this chapter, we will now look at the next step in a text analysis project: data curation. That is, the process of converting the original data we acquire to a tidy dataset. Acquired data can come in a wide variety of formats. These formats tend to signal the richness of the metadata that is included in the file content. We will consider three general types of content formats: (1) unstructured data, (2) structured data, and (3) semi-structured data. Regardless of the file type and the structure of the data, it will be necessary to consider how to curate a dataset such that the structure reflects the basic unit of analysis that we wish to investigate. The resulting dataset will form the base from which we will work to further transform the dataset such that it aligns with the unit(s) of observation required for the analysis method that we will implement. Once the dataset is curated, we will create a data dictionary that describes the dataset and the variables that are included in the dataset for transparency and reproducibility.
6.1 Unstructured
The bulk of textual data is of the unstructured variety. Unstructured data is data that has not been organized to make the information contained within machine-readable. Remember that text in itself is not information. Only when given explicit context in the form of metadata does text become informative. Metadata can be linguistic or non-linguistic in nature. So for unstructured data there is little to no metadata directly associated with the data.
Reading data
Some of the common file formats which contain unstructured data include TXT, PDF, and DOCX. Although these formats are unstructured, they are not the same. Reading these files into R requires different techniques and tools.
There are many ways to read TXT files into R and many packages that can be used to do so. For example, using {readr}, we can choose to read the entire file into a single vector of character strings with read_file()
or read the file by lines with read_lines()
in which each line is a character string in a vector.
Less commonly used in prepared data resources, PDF and DOCX files are more complex than TXT files as they contain formatting and embedded document metadata. However, these attributes are primarily for visual presentation and not for machine-readability. Needless to say, we need an alternate strategy to extract the text content from these files and potentially some of the metadata. For example, using {readtext} (Benoit & Obeng, 2024), we can read the text content from PDF and DOCX files into a single vector of character strings with readtext()
.
Whether in TXT, PDF, or DOCX format, the resulting data structure will require further processing to convert the data into a tidy dataset.
Orientation
As an example of curating an unstructured source of corpus data, let’s take a look at the Europarl Parallel Corpus (Koehn, 2005). This corpus contains parallel texts (source and translated documents) from the European Parliamentary proceedings between 1996 and 2011 for some 21 European languages.
Let’s assume we selected this corpus because we are interested in researching Spanish to English translations. After consulting the corpus website, downloading the archive file, and inspecting the unarchived structure, we have the file structure seen in Snippet 6.1.
Snippet 6.1 Project directory structure for the Europarl Parallel Corpus
project/
├── process/
│ ├── 1-acquire-data.qmd
│ ├── 2-curate-data.qmd
│ └── ...
├── data/
│ ├── analysis/
│ ├── derived/
│ └── original/
│ │── europarl_do.csv
│ └── europarl/
│ ├── europarl-v7.es-en.en
│ └── europarl-v7.es-en.es
├── reports/
├── DESCRIPTION
├── Makefile
└── README
The europarl_do.csv file contains the data origin information documented as part of the acquisition process. The contents are seen in Table 6.1.
Now let’s get familiar with the corpus directory structure and the files. In Snippet 6.1, we see that there are two corpus files, europarl-v7.es-en.es and europarl-v7.es-en.en, that contain the source and target language texts, respectively. The file names indicate that the files contain Spanish-English parallel texts. The .es and .en extensions indicate the language of the text.
Looking at the beginning of the .es and .en files, in Snippet 6.2 and Snippet 6.3, we see that the files contain a series of lines in either the source or target language.
Snippet 6.2 europarl-v7.es-en.es file
Reanudación del período de sesiones
Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones.
Como todos han podido comprobar, el gran "efecto del año 2000" no se ha producido. En cambio, los ciudadanos de varios de nuestros países han sido víctimas de catástrofes naturales verdaderamente terribles.
Sus Señorías han solicitado un debate sobre el tema para los próximos días, en el curso de este período de sesiones. A la espera de que se produzca, de acuerdo con muchos colegas que me lo han pedido, pido que hagamos un minuto de silencio en memoria de todas las víctimas de las tormentas, en los distintos países de la Unión Europea afectados.
We can clearly appreciate that the data is unstructured. That is, there is no explicit metadata associated with the data. The data is just a series of character strings separated by lines. The only information that we can surmise from structure of the data is that the texts are line-aligned and that the data in each file corresponds to source and target languages.
Snippet 6.3 europarl-v7.es-en.en file
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session. In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
Now, before embarking on a data curation process, it is recommendable to define the structure of the data that we want to create. I call this the “idealized structure” of the data. For a curated dataset, we want to reflect the contents of the original data, yet in a tidy format, to maintain the integrity of and connection with the data.
Given what we know about the data, we can define the idealized structure of the data as seen in Table 6.2.
variable | name | type | description |
---|---|---|---|
type | Document type | character | Contains the type of document, either ‘Source’ or ‘Target’ |
line | Line | character | Contains the text of each line in the document |
Our task now is to develop code that will read the original data and render the idealized structure as a curated dataset for each corpus file. We will then write the datasets to the data/derived/ directory. The code we develop will be added to the 2-curate-data.qmd file. And finally, the datasets will be documented with a data dictionary file.
Tidy the data
To create the idealized dataset structure in Table 6.2, let’s start by reading the files by lines into R. As the files are aligned by lines, we will use the read_lines()
function to read the files into character vectors.
Example 6.1
# Load package
library(readr)
# Read Europarl files .es and .en
europarl_es_chr <-
read_lines("../data/original/europarl-v7.es-en.es")
europarl_en_chr <-
read_lines("../data/original/europarl-v7.es-en.en")
Using the read_lines()
function, we read each line of the files into a character vector. Since the Europarl corpus is a parallel corpus, the lines in the source and target files are aligned. This means that the first line in the source file corresponds to the first line in the target file, the second line in the source file corresponds to the second line in the target file, and so on. This alignment is important for the analysis of parallel corpora, as it allows us to compare the source and target texts line by line.
Let’s inspect our character vectors to ensure that they are of the length and appear to be structured as we expect. We can use the length()
function to get the number of lines in each file and the head()
function to preview the first few lines of each file.
Example 6.2
# Inspect Spanish character vector
length(europarl_es_chr)
[1] 1965734
head(europarl_es_chr, 5)
[1] "Reanudación del período de sesiones"
[2] "Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el viernes 17 de diciembre pasado, y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones."
[3] "Como todos han podido comprobar, el gran \"efecto del año 2000\" no se ha producido. En cambio, los ciudadanos de varios de nuestros países han sido víctimas de catástrofes naturales verdaderamente terribles."
[4] "Sus Señorías han solicitado un debate sobre el tema para los próximos días, en el curso de este período de sesiones."
[5] "A la espera de que se produzca, de acuerdo con muchos colegas que me lo han pedido, pido que hagamos un minuto de silencio en memoria de todas las víctimas de las tormentas, en los distintos países de la Unión Europea afectados."
# Inspect English character vector
length(europarl_en_chr)
[1] 1965734
head(europarl_en_chr, 5)
[1] "Resumption of the session"
[2] "I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period."
[3] "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful."
[4] "You have requested a debate on this subject in the course of the next few days, during this part-session."
[5] "In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union."
The output of Example 6.2 shows that the number of lines in each file is the same. This is good. If the number of lines in each file was different, we would need to figure out why and fix it. We also see that the content of the files is aligned as expected.
Let’s now create a dataset for each of the character vectors. We will use the tibble()
function from {tibble} to create a data frame object with the character vectors as the line
column and add a type
column with the value ‘Source’ for the Spanish file and ‘Target’ for the English file. We will assign the output two new objects europarl_source_df
and europarl_target_df
, respectively, as seen in Example 6.3.
Example 6.3
# Create source data frame
europarl_source_df <-
tibble(
type = "Source",
lines = europarl_es_chr
)
# Create target data frame
europarl_target_df <-
tibble(
type = "Target",
lines = europarl_en_chr
)
Inspecting these data frames with glimpse()
in Example 6.4, we can see if the data frames have the structure we expect.
Example 6.4
# Preview source
glimpse(europarl_source_df)
# Preview target
glimpse(europarl_target_df)
Rows: 1,965,734
Columns: 2
$ type <chr> "Source", "Source", "Source", "Source", "Source", "Source", "Sou…
$ lines <chr> "Reanudación del período de sesiones", "Declaro reanudado el per…
Rows: 1,965,734
Columns: 2
$ type <chr> "Target", "Target", "Target", "Target", "Target", "Target", "Tar…
$ lines <chr> "Resumption of the session", "I declare resumed the session of t…
We now have our type
and lines
columns and the associated observations for our idealized dataset, in Table 6.2. We can now write these datasets to the data/derived/ directory using write_csv()
and create corresponding data dictionary files.
6.2 Structured
Structured data already reflects the physical and semantic structure of a tidy dataset. This means that the data is already in a tabular format and the relationships between columns and rows are already well-defined. Therefore, the heavy lifting of curating the data is already done. There are two remaining questions, however, that need to be taken into account. One, logistical question, is what file format the dataset is in and how to read it into R. And the second, more research-based, is whether the data may benefit from some additional curation and documentation to make it more amenable to analysis and more understandable to others.
Reading datasets
Let’s consider some common formats for structured data, i.e. datasets, and how to read them into R. First, we will consider R-native formats, such as package datasets and RDS files. Then will consider non-native formats, such as relational databases and datasets produced by other software. Finally, we will consider software agnostic formats, such as CSV.
R and some R packages provide structured datasets that are available for use directly within R. For example, {languageR} (Baayen & Shafaei-Bajestan, 2019) provides the dative
dataset, which is a dataset containing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection. {janeaustenr} (Silge, 2022) provides the austen_books
dataset, which is a dataset of Jane Austen’s novels. Package datasets are loaded into an R session using either the data()
function, if the package is loaded, or the ::
operator, if the package is not loaded, data(dative)
or languageR::dative
, respectively.
R also provides a native file format for storing R objects, the RDS file. Any R object, including data frames, can be written from an R session to disk by using the write_rds()
function from readr
. The .rds files will be written to disk in a binary format that is not human-readable, which is not ideal for transparent data sharing. However, the files and the R objects can be read back into an R session using the read_rds()
function with all the attributes intact, such as vector types, factor levels, etc.
R provides a suite of tools for importing data from non-native structured sources such as databases and datasets from software such as SPSS, SAS, and Stata. For instance, if you are working with data stored in a relational database such as MySQL, PostgreSQL, or SQLite, you can use {DBI} (R Special Interest Group on Databases (R-SIG-DB), Wickham, & Müller, 2024) to connect to the database and {dbplyr} (Wickham, Girlich, & Ruiz, 2024) to query the database using the SQL language. Files from SPSS (.sav), SAS (.sas7bdat), and Stata (.dta) can be read into R using {haven} (Wickham, Miller, & Smith, 2023).
Software agnostic file formats include delimited files, such as CSV, TSV, etc. These file formats lack the robust structural attributes of the other formats, but balance this shortcoming by storing structured data in more accessible, human-readable format. Delimited files are plain text files which use a delimiter, such as a comma (,
), tab (\t
), or pipe (|
), to separate the columns and rows. For example, a CSV file is a delimited file where the columns and rows are separated by commas, as seen in Example 6.5.
Example 6.5
column_1,column_2,column_3
row 1 value 1,row 1 value 2,row 1 value 3 row 2 value 1,row 2 value 2,row 2 value 3
Given the accessibility of delimited files, they are a common format for sharing structured data in reproducible research. It is not surprising, then, that this is the format which we have chosen for the derived datasets in this book.
Orientation
With an understanding of the various structured formats, we can now turn to considerations about how the original dataset is structured and how that structure is to be used for a given research project. As an example, we will work with the CABNC datasets acquired in Chapter 5. The structure of the original dataset is shown in Snippet 6.4.
Snippet 6.4 Directory structure for the CABNC datasets
data/
├── analysis/
├── derived/
└── original/
├── cabnc_do.csv
└── cabnc/
├── participants.csv
├── token_types.csv
├── tokens.csv
├── transcripts.csv
└── utterances.csv
In addition to other important information, the data origin file cabnc_do.csv shown in Table 6.3 informs us the datasets are related by a common variable.
attribute | description |
---|---|
Resource name | CABNC. |
Data source | https://ca.talkbank.org/access/CABNC.html, doi:10.21415/T55Q5R |
Data sampling frame | Over 400 British English speakers from across the UK stratified age, gender, social group, and region, and recording their language output over a set period of time. |
Data collection date(s) | 1992. |
Data format | CSV Files |
Data schema | The recordings are linked by filename and the participants are linked by who . |
License | CC BY NC SA 3.0 |
Attribution | Saul Albert, Laura E. de Ruiter, and J.P. de Ruiter (2015) CABNC: the Jeffersonian transcription of the Spoken British National Corpus. https://saulalbert.github.io/CABNC/. |
The CABNC datasets are structured in a relational format, which means that the data is stored in multiple tables that are related to each other. The tables are related by a common column or set of columns, which are called keys. A key is used to join the tables together to create a single dataset. There are two keys in the CABNC datasets, filename
and who
. Each variable corresponds to recording- and/ or participant-oriented datasets.
Now, let’s envision a scenario in which we are preparing our data for a study that aims to investigate the relationship between speaker demographics and utterances. In their original format, the CABNC datasets separate information about utterances and speakers in separate datasets, cabnc_utterances
and cabnc_participants
, respectively. Ideally, we would like to curate these datasets such that the information about the utterances and the speakers are ready to be joined as part of the dataset transformation process, while still retaining the relevant original structure. This usually involves removing redundant and/ or uninformative variables and/ or adjusting variable names and writing these datasets and their documentation files to disk.
Tidy the dataset
With these goals in mind, let’s start the process of curation by reading the relevant datasets into an R session. Since we are working with CSV files we will use the read_csv()
function, as seen in Example 6.6.
Example 6.6
The next step is to inspect the structure of the datasets. We can use the glimpse()
function for this task.
Example 6.7
#Preview the structure of the datasets
glimpse(cabnc_utterances)
glimpse(cabnc_participants)
Rows: 235,901
Columns: 10
$ filename <chr> "KB0RE000", "KB0RE000", "KB0RE000", "KB0RE000", "KB0RE000", …
$ path <chr> "ca/CABNC/KB0/KB0RE000", "ca/CABNC/KB0/KB0RE000", "ca/CABNC/…
$ utt_num <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ who <chr> "PS002", "PS006", "PS002", "PS006", "PS002", "PS006", "PS002…
$ role <chr> "Unidentified", "Unidentified", "Unidentified", "Unidentifie…
$ postcodes <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ gems <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ utterance <chr> "You enjoyed yourself in America", "Eh", "did you", "Oh I co…
$ startTime <dbl> 0.208, 2.656, 2.896, 3.328, 5.088, 6.208, 8.320, 8.480, 10.2…
$ endTime <dbl> 2.67, 2.90, 3.33, 5.26, 6.02, 8.50, 9.31, 11.23, 14.34, 15.9…
Rows: 6,190
Columns: 13
$ filename <chr> "KB0RE004", "KB0RE004", "KB0RE004", "KB0RE006", "KB0RE006", …
$ path <chr> "ca/CABNC/0missing/KB0RE004", "ca/CABNC/0missing/KB0RE004", …
$ who <chr> "PS008", "PS009", "KB0PSUN", "PS007", "PS008", "PS009", "KB0…
$ name <chr> "John", "Gethyn", "Unknown_speaker", "Alan", "John", "Gethyn…
$ role <chr> "Unidentified", "Unidentified", "Unidentified", "Unidentifie…
$ language <chr> "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng", "eng…
$ monthage <dbl> 481, 481, 13, 949, 481, 481, 13, 637, 565, 13, 637, 565, 13,…
$ age <chr> "40;01.01", "40;01.01", "1;01.01", "79;01.01", "40;01.01", "…
$ sex <chr> "male", "male", "male", "male", "male", "male", "male", "mal…
$ numwords <dbl> 28, 360, 156, 1610, 791, 184, 294, 93, 3, 0, 128, 24, 0, 150…
$ numutts <dbl> 1, 9, 27, 7, 5, 7, 6, 5, 1, 0, 11, 6, 0, 110, 74, 96, 12, 1,…
$ avgutt <dbl> 28.00, 40.00, 5.78, 230.00, 158.20, 26.29, 49.00, 18.60, 3.0…
$ medianutt <dbl> 28, 39, 5, 84, 64, 9, 3, 15, 3, 0, 9, 3, 0, 7, 6, 4, 3, 12, …
From visual inspection of the output of Example 6.7 we can see that there are common variables in both datasets. In particular, we see the filename
and who
variables mentioned in the data origin file cabnc_do.csv.
The next step is to consider the variables that will be useful for future analysis. Since we are creating a curated dataset, the goal will be to retain as much information as possible from the original datasets. There are cases, however, in which there may be variables that are not informative and, thus, will not prove useful for any analysis. These removable variables tend to be of one of two types: variables which show no variation across observations and variables where the information is redundant.
As an example case, let’s look at the cabnc_participants
data frame. We can use the skim()
function from {skimr} to get a summary of the variables in the dataset. We can add the yank()
function to look at variable types one at a time. We will start with the character variables, as seen in Example 6.8.
Example 6.8
# Summarize character variables
cabnc_participants |>
skim() |>
yank("character")
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 filename 0 1 8 8 0 2020 0
2 path 0 1 21 26 0 2020 0
3 who 0 1 4 7 0 581 0
4 name 0 1 3 25 0 269 0
5 role 0 1 12 12 0 1 0
6 language 0 1 3 3 0 1 0
7 age 0 1 7 8 0 83 0
8 sex 0 1 4 6 0 2 0
We see from the output in Example 6.8, that the variables role
and language
have a single unique value. This means that these variables do not show any variation across observations. We will remove these variables from the dataset.
Continuing on, let’s look for redundant variables. We see that the variables filename
and path
have the same number of unique values. And if we combine this with the visual summary in Example 6.7, we can see that the path
variable is redundant. We will remove this variable from the dataset.
Another potentially redundant set of variables are who
and name
—both of which are speaker identifiers. The who
variable is a unique identifier, but there may be some redundancy with the name
variable, that is, there may be two speakers with the same name. We can check this by looking at the number of unique values in the who
and name
variables from the skim()
output in Example 6.8. who
has 568 unique values and name
has 269 unique values. This suggests that there are multiple speakers with the same name.
Another way to explore this is to look at the number of unique values in the who
variable for each unique value in the name
variable. We can do this using the group_by()
and summarize()
functions from {dplyr}. For each value of name
, we will count the number of unique values in who
with n_distinct()
and then sort the results in descending order.
Example 6.9
cabnc_participants |>
group_by(name) |>
summarize(n = n_distinct(who)) |>
arrange(desc(n)) |>
slice_head(n = 5)
# A tibble: 5 × 2
name n
<chr> <int>
1 None 59
2 Unknown_speaker 59
3 Group_of_unknown_speakers 21
4 Chris 9
5 David 9
It is good that we performed the check in Example 6.9 beforehand. In addition to speakers with the same name, such as ‘Chris’ and ‘David’, we also have multiple speakers with generic codes, such as ‘None’ and ‘Unknown_speaker’. It is clear that name
is redundant.
With this in mind, we can then safely remove the following variables from the dataset: role
, language
, name
, and path
. To drop variables from a data frame we can use the select()
function in combination with the -
operator. The -
operator tells the select()
function to drop the variable that follows it.
Example 6.10
# Drop variables
cabnc_participants <-
cabnc_participants |>
select(-role, -language, -name, -path)
# Preview the dataset
glimpse(cabnc_participants)
Rows: 6,190
Columns: 9
$ filename <chr> "KB0RE004", "KB0RE004", "KB0RE004", "KB0RE006", "KB0RE006", …
$ who <chr> "PS008", "PS009", "KB0PSUN", "PS007", "PS008", "PS009", "KB0…
$ monthage <dbl> 481, 481, 13, 949, 481, 481, 13, 637, 565, 13, 637, 565, 13,…
$ age <chr> "40;01.01", "40;01.01", "1;01.01", "79;01.01", "40;01.01", "…
$ sex <chr> "male", "male", "male", "male", "male", "male", "male", "mal…
$ numwords <dbl> 28, 360, 156, 1610, 791, 184, 294, 93, 3, 0, 128, 24, 0, 150…
$ numutts <dbl> 1, 9, 27, 7, 5, 7, 6, 5, 1, 0, 11, 6, 0, 110, 74, 96, 12, 1,…
$ avgutt <dbl> 28.00, 40.00, 5.78, 230.00, 158.20, 26.29, 49.00, 18.60, 3.0…
$ medianutt <dbl> 28, 39, 5, 84, 64, 9, 3, 15, 3, 0, 9, 3, 0, 7, 6, 4, 3, 12, …
Now we have a frame with 9 more informative variables which describe the participants. We would then repeat this process for the cabnc_utterances
dataset to remove redundant and uninformative variables.
Another, optional step, is to rename and/ or organize the order the variables to make the dataset more understandable. Let’s organize the columns to read left to right from most general to most specific. Again, we turn to the select()
function, this time including the variables in the order we want them to appear in the dataset. We will take this opportunity to rename some of the variable names so that they are more informative.
Example 6.11
# Rename variables
cabnc_participants <-
cabnc_participants |>
select(
doc_id = filename,
part_id = who,
part_age = monthage,
part_sex = sex,
num_words = numwords,
num_utts = numutts,
avg_utt_len = avgutt,
median_utt_len = medianutt
)
# Preview the dataset
glimpse(cabnc_participants)
Rows: 6,190
Columns: 8
$ doc_id <chr> "KB0RE004", "KB0RE004", "KB0RE004", "KB0RE006", "KB0RE0…
$ part_id <chr> "PS008", "PS009", "KB0PSUN", "PS007", "PS008", "PS009",…
$ part_age <dbl> 481, 481, 13, 949, 481, 481, 13, 637, 565, 13, 637, 565…
$ part_sex <chr> "male", "male", "male", "male", "male", "male", "male",…
$ num_words <dbl> 28, 360, 156, 1610, 791, 184, 294, 93, 3, 0, 128, 24, 0…
$ num_utts <dbl> 1, 9, 27, 7, 5, 7, 6, 5, 1, 0, 11, 6, 0, 110, 74, 96, 1…
$ avg_utt_len <dbl> 28.00, 40.00, 5.78, 230.00, 158.20, 26.29, 49.00, 18.60…
$ median_utt_len <dbl> 28, 39, 5, 84, 64, 9, 3, 15, 3, 0, 9, 3, 0, 7, 6, 4, 3,…
The variable order is organized after running Example 6.11. Now let’s sort the rows by doc_id
and part_id
so that the dataset is sensibly organized. The arrange()
function takes a data frame and a list of variables to sort by, in the order they are listed.
Example 6.12
# Sort rows
cabnc_participants <-
cabnc_participants |>
arrange(doc_id, part_id)
# Preview the dataset
cabnc_participants |>
slice_head(n = 5)
# A tibble: 5 × 8
doc_id part_id part_age part_sex num_words num_utts avg_utt_len median_utt_len
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 KB0RE… KB0PSUN 13 male 2 2 1 1
2 KB0RE… PS002 721 female 759 74 10.3 7
3 KB0RE… PS006 601 male 399 64 6.23 5
4 KB0RE… KB0PSUN 13 male 7 3 2.33 1
5 KB0RE… PS005 481 female 257 32 8.03 8
Applying the sorting in Example 6.12, we can see that the utterances are now our desired order, a dataset that reads left to right from document to participant-oriented attributes and top to bottom by document and participant.
6.3 Semi-structured
Between unstructured and structured data falls semi-structured data. And as the name suggests, it is a hybrid data format. This means that there will be important structured metadata included with unstructured elements. The file formats and approaches to encoding the structured aspects of the data vary widely from resource to resource and therefore often require more detailed attention to the structure of the data and often include more sophisticated programming strategies to curate the data to produce a tidy dataset.
Reading data
The file formats associated with semi-structured data include a wide range. These include file formats conducive to more structured-leaning data, such as XML, HTML, and JSON, and file formats with more unstructured-leaning data, such as annotated TXT files. Annotated TXT files may in fact appear with the .txt extension, but may also appear with other, sometimes resource-specific, extensions, such as .utt for the Switchboard Dialog Act Corpus or .cha for the Child Language Data Exchange System (CHILDES) annotation files, for example.
The more structured file formats use standard conventions and therefore can be read into an R session with format-specific functions. Say, for example, we are working with data in a JSON file format. We can read the data into an R session with the read_json()
function from {jsonlite} (Ooms, 2023). For XML and HTML files, {rvest} (Wickham, 2024) provides the read_xml()
and read_html()
functions.
Semi-structured data in TXT files can be read either as a file or by lines. The choice of which approach to take depends on the structure of the data. If the data structure is line-based, then read_lines()
often makes more sense than read_file()
. However, in some cases, the data may be structured in a way that requires the entire file to be read into an R session and then subsequently parsed.
Orientation
To provide an example of the curation process using semi-structured data, we will work with the Europarl corpus of native, non-native and translated texts (ENNTT) corpus (Nisioi, Rabinovich, Dinu, & Wintner, 2016). The ENNTT corpus contains native and translated English drawn from European Parliament proceedings. Let’s look at the directory structure for the ENNTT corpus in Snippet 6.5.
Snippet 6.5 Data directory structure for the ENNTT corpus
data/
├── analysis/
├── derived/
└── original/
├── enntt_do.csv
└── enntt/
├── natives.dat
├── natives.tok
├── nonnatives.dat
├── nonnatives.tok
├── translations.dat
└── translations.tok
We now inspect the data origin file for the ENNTT corpus, enntt_do.csv, in Table 6.4.
attribute | description |
---|---|
Resource name | Europarl corpus of Native, Non-native and Translated Texts — ENNTT |
Data source | https://github.com/senisioi/enntt-release |
Data sampling frame | English, European Parliament texts, transcribed discourse, political genre |
Data collection date(s) | Not specified in the repository |
Data format | .tok, .dat |
Data schema | .tok files contain the actual text; .dat files contain the annotations corresponding to each line in the .tok files. |
License | Not specified. Contact the authors for more information. |
Attribution | Nisioi, S., Rabinovich, E., Dinu, L. P., & Wintner, S. (2016). A corpus of native, non-native and translated texts. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). |
According to the data origin file, there are two important file types, .dat and .tok. The .dat files contain annotations and the .tok files contain the actual text. Let’s inspect the first couple of lines in the .dat file for the native speakers, nonnatives.dat, in Snippet 6.6.
Snippet 6.6 Example .dat file for the non-native speakers
LINE STATE="Poland" MEPID="96779" LANGUAGE="EN" NAME="Danuta Hübner," SEQ_SPEAKER_ID="184" SESSION_ID="ep-05-11-17"/>
<LINE STATE="Poland" MEPID="96779" LANGUAGE="EN" NAME="Danuta Hübner," SEQ_SPEAKER_ID="184" SESSION_ID="ep-05-11-17"/> <
We see that the .dat file contains annotations for various session and speaker attributes. The format of the annotations is XML-like. XML is a form of markup language, such as YAML, JSON, etc. Markup languages are used to annotate text with additional information about the structure, meaning, and/ or presentation of text. In XML, structure is built up by nesting of nodes. The nodes are named with tags, which are enclosed in angle brackets, <
and >
. Nodes are opened with <TAG>
and closed with </TAG>
. In Snippet 6.7 we see an example of a simple XML file structure.
Snippet 6.7 Example .xml file structure
<?xml version="1.0" encoding="UTF-8"?>
book category="fiction">
<title lang="en">The Catcher in the Rye</title>
<author>J.D. Salinger</author>
<year>1951</year>
<book> </
In Snippet 6.7 there are four nodes, three of which are nested inside of the <book>
node. The <book>
node in this example is the root node. XML files require a root node. Nodes can also have attributes, such as the category
attribute in the <book>
node, but they are not required. Furthermore, XML files also require a declaration, which is the first line in Snippet 6.7. The declaration specifies the version of XML used and the encoding.
So the .dat file is not strict XML, but is similar in that it contains nodes and attributes. An XML variant you are likely familiar with, HTML, has more relaxed rules than XML. HTML is a markup language used to annotate text with information about the organization and presentation of text on the web that does not require a root node or a declaration —much like our .dat file. So suffice it to say that the .dat file can safely be treated as HTML.
And the .tok file for native speakers, nonnatives.tok, in Snippet 6.8, shows the actual text for each line in the corpus.
Snippet 6.8 Example .tok file for the non-native speakers
The Commission is following with interest the planned construction of a nuclear power plant in Akkuyu , Turkey and recognises the importance of ensuring that the construction of the new plant follows the highest internationally accepted nuclear safety standards . According to our information , the decision on the selection of a bidder has not been taken yet .
In a study in which we are interested in contrasting the language of natives and non-natives, we will want to combine the .dat and .tok files for these groups of speakers.
The question is what attributes we want to include in the curated dataset. Given the research focus, we will not need the LANGUAGE
or NAME
attributes. We may want to modify the attribute names so they are a bit more descriptive.
An idealized version of the curated dataset based on this criteria is shown in Table 6.5.
variable | name | type | description |
---|---|---|---|
session_id | Session ID | character | Unique identifier for each session. |
speaker_id | Speaker ID | integer | Unique identifier for each speaker. |
state | State | character | The political state of the speaker. |
type | Type | character | Indicates whether the text is native or non-native |
session_seq | Session Sequence | integer | The sequence of the text in the session. |
text | Text | character | Contains the text of the line, and maintains the structure of the original data. |
Tidy the data
Now that we have a better understanding of the corpus data and our target curated dataset structure, let’s work to extract and organize the data from the native and non-native files.
The general approach we will take is, for native and then non-natives, to read in the .dat file as an HTML file and then extract the line nodes and their attributes combining them into a data frame. Then we’ll read in the .tok file as a text file and then combine the two into a single data frame.
Starting with the natives, we use {rvest} to read in the .dat file as an XML file with the read_html()
function and then extract the line nodes with the html_elements()
function as in Example 6.13.
Example 6.13
# Load packages
library(rvest)
# Read in *.dat* file as HTML
ns_dat_lines <-
read_html("../data/original/enntt/natives.dat") |>
html_elements("line")
# Inspect
class(ns_dat_lines)
typeof(ns_dat_lines)
length(ns_dat_lines)
[1] "xml_nodeset"
[1] "list"
[1] 116341
We can see that the ns_dat_lines
object is a special type of list, xml_nodeset
which contains 116,341 line nodes. Let’s now jump out of sequence and read in the .tok file as a text file, in Example 6.14, again by lines using read_lines()
, and compare the two to make sure that our approach will work.
Example 6.14
# Read in *.tok* file by lines
ns_tok_lines <-
read_lines("../data/enntt/original/natives.tok")
# Inspect
class(ns_tok_lines)
typeof(ns_tok_lines)
length(ns_tok_lines)
[1] "character"
[1] "character"
[1] 116341
We do, in fact, have the same number of lines in the .dat and .tok files. So we can proceed with extracting the attributes from the line nodes and combining them with the text from the .tok file.
Let’s start by listing the attributes of the first line node in the ns_dat_lines
object. To do this we will draw on the pluck()
function from {purrr} (Wickham & Henry, 2023) to extract the first line node. Then, we use the html_attrs()
function to get the attribute names and the values, as in Example 6.15.
Example 6.15
# Load package
library(purrr)
# List attributes line node 1
ns_dat_lines |>
pluck(1) |>
html_attrs()
state mepid language name
"United Kingdom" "2099" "EN" "Evans, Robert J"
seq_speaker_id session_id
"2" "ep-00-01-17"
No surprise here, these are the same attributes we saw in the .dat file preview in Snippet 6.6. At this point, it’s good to make a plan on how to associate the attribute names with the column names in our curated dataset.
-
session_id
=session_id
-
speaker_id
=MEPID
-
state
=state
-
session_seq
=seq_speaker_id
We can do this one attribute at a time using the html_attr()
function and then combine them into a data frame with the tibble()
function as in Example 6.16.
Example 6.16
# Extract attributes from first line node
session_id <- ns_dat_lines |> pluck(1) |> html_attr("session_id")
speaker_id <- ns_dat_lines |> pluck(1) |> html_attr("mepid")
state <- ns_dat_lines |> pluck(1) |> html_attr("state")
session_seq <- ns_dat_lines |> pluck(1) |> html_attr("seq_speaker_id")
# Combine into data frame
tibble(session_id, speaker_id, state, session_seq)
# A tibble: 1 × 4
session_id speaker_id state session_seq
<chr> <chr> <chr> <chr>
1 ep-00-01-17 2099 United Kingdom 2
The results from Example 6.16 show that the attributes have been extracted and mapped to our idealized column names, but this would be tedious to do for each line node. A function to extract attributes and values from a line and add them to a data frame would help simplify this process. The function in Example 6.17 does just that.
Example 6.17
# Function to extract attributes from line node
extract_dat_attrs <- function(line_node) {
session_id <- line_node |> html_attr("session_id")
speaker_id <- line_node |> html_attr("mepid")
state <- line_node |> html_attr("state")
session_seq <- line_node |> html_attr("seq_speaker_id")
tibble(session_id, speaker_id, state, session_seq)
}
It’s a good idea to test out the function to verify that it works as expected. We can do this by passing the various indices to the ns_dat_lines
object to the function as in Example 6.18.
Example 6.18
# Test function
ns_dat_lines |> pluck(1) |> extract_dat_attrs()
ns_dat_lines |> pluck(20) |> extract_dat_attrs()
ns_dat_lines |> pluck(100) |> extract_dat_attrs()
# A tibble: 1 × 4
session_id speaker_id state session_seq
<chr> <chr> <chr> <chr>
1 ep-00-01-17 2099 United Kingdom 2
# A tibble: 1 × 4
session_id speaker_id state session_seq
<chr> <chr> <chr> <chr>
1 ep-00-01-17 1309 United Kingdom 40
# A tibble: 1 × 4
session_id speaker_id state session_seq
<chr> <chr> <chr> <chr>
1 ep-00-01-18 4549 United Kingdom 28
It looks like the extract_dat_attrs()
function is ready for prime-time. Let’s now apply it to all of the line nodes in the ns_dat_lines
object using the map_dfr()
function from {purrr} as in Example 6.19.
Example 6.19
# Extract attributes from all line nodes
ns_dat_attrs <-
ns_dat_lines |>
map_dfr(extract_dat_attrs)
# Inspect
glimpse(ns_dat_attrs)
Rows: 116,341
Columns: 4
$ session_id <chr> "ep-00-01-17", "ep-00-01-17", "ep-00-01-17", "ep-00-01-17"…
$ speaker_id <chr> "2099", "2099", "2099", "4548", "4548", "4541", "4541", "4…
$ state <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
$ session_seq <chr> "2", "2", "2", "4", "4", "12", "12", "12", "12", "12", "12…
We can see that the ns_dat_attrs
object is a data frame with 116,341 rows and 4 columns, just has we expected. We can now combine the ns_dat_attrs
data frame with the ns_tok_lines
vector to create a single data frame with the attributes and the text. This is done with the mutate()
function assigning the ns_tok_lines
vector to a new column named text
as in Example 6.20.
Example 6.20
# Combine attributes and text
ns_dat <-
ns_dat_attrs |>
mutate(text = ns_tok_lines)
# Inspect
glimpse(ns_dat)
Rows: 116,341
Columns: 5
$ session_id <chr> "ep-00-01-17", "ep-00-01-17", "ep-00-01-17", "ep-00-01-17"…
$ speaker_id <chr> "2099", "2099", "2099", "4548", "4548", "4541", "4541", "4…
$ state <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Uni…
$ session_seq <chr> "2", "2", "2", "4", "4", "12", "12", "12", "12", "12", "12…
$ text <chr> "You will be aware from the press and television that ther…
This is the data for the native speakers. We can now repeat this process for the non-native speakers, or we can create a function to do it for us.
After applying the curation steps to both the native and non-native datasets, we will have two data frames, enntt_ns_df
and enntt_nns_df
, respectively that meet the idealized structure for the curated ENNTT Corpus datasets, as shown in Table 6.5. The enntt_ns_df
and enntt_nns_df
data frames are ready to be written to disk and documented.
6.4 Documentation
After applying the curation steps to our data, we will now want to write the dataset to disk and to do our best to document the process and the resulting dataset.
Since data frames are a tabular, we will have various options for the file type to write. Many of these formats are software-specific, such as *.xlsx
for Microsoft Excel, *.sav
for SPSS, *.dta
for Stata, and *.rds
for R. We will use the *.csv
format since it is a common format that can be read by many software packages. We will use the write_csv()
function from {readr} to write the dataset to disk.
Now the question is where to save our CSV file. Since our dataset is derived by our work, we will added it to the derived/ directory. If you are working with multiple data sources within the same project, it is a good idea to create a sub-directory for each dataset. This will help keep the project organized and make it easier to find and access the datasets.
The final step, as always, is to provide documentation. For datasets the documentation is a data dictionary, as discussed in Section 2.3.2. As with data origin files, you can use spreadsheet software to create and edit the data dictionary.
In {qtkit} we have a function, create_data_dictionary()
that will generate the scaffolding for a data dictionary. The function takes two arguments, data
and file_path
. It reads the dataset columns and provides a template for the data dictionary.
An example of a data dictionary, a data dictionary for the enntt_ns_df
dataset is shown in Table 6.6.
enntt_ns_df
dataset
variable | name | type | description |
---|---|---|---|
session_id | Session ID | categorical | Unique identifier for each session |
speaker_id | Speaker ID | categorical | Unique identifier for each speaker |
state | State | categorical | Name of the state or country the session is linked to |
session_seq | Session Sequence | ordinal | Sequence number in the session |
text | Text | categorical | Text transcript of the session |
type | Type | categorical | The type of the speaker, whether native or nonnative |
Activities
The following activities build on your skills and knowledge to use R to read, inspect, and write data and datasets in R. In these activities you will have an opportunity to learn and apply your skills and knowledge to the task of curating datasets. This is a vital component of text analysis research that uses unstructured and semi-structured data.
Summary
In this chapter we looked at the process of structuring data into a dataset. This included a discussion on three main types of data —unstructured, structured, and semi-structured. The level of structure of the original data(set) will vary from resource to resource and by the same token so will the file format used to support the level of metadata included. The results from data curation results in a dataset that is saved separate from the original data in order to maintain modularity between what the data(set) look like before we intervene and afterwards. Since there can be multiple analysis approaches applied to the original data in a research project, this curated dataset serves as the point of departure for each of the subsequent datasets derived from the transformational steps. In addition to the code we use to derive the curated dataset’s structure, we also include a data dictionary which documents the variables and measures in the curated dataset.