Skip to contents

This function processes and curates ENNTT (European Parliament) data from a specified directory. It handles both .dat files (containing XML metadata) and .tok files '(containing text content).

Usage

curate_enntt_data(dir_path)

Arguments

dir_path

A string. The path to the directory containing the ENNTT data files. Must be an existing directory.

Value

A tibble containing the curated ENNTT data with columns:

  • session_id: Parliamentary session identifier

  • speaker_id: Speaker's MEP ID

  • state: Representative's state/country

  • session_seq: Sequential position in session

  • text: Speech content

  • type: Corpus type identifier

Details

The function expects a directory containing paired .dat and .tok files with matching names, as found in the raw ENNTT data https://github.com/senisioi/enntt-release. The .dat files should contain XML-formatted metadata with attributes:

  • session_id: Unique identifier for the parliamentary session

  • mepid: Member of European Parliament ID

  • state: Country or state representation

  • seq_speaker_id: Sequential ID within the session

The .tok files should contain the corresponding text content, one entry per line.

Examples

# Example using simulated data bundled with the package
example_data <- system.file("extdata", "simul_enntt", package = "qtkit")
curated_data <- curate_enntt_data(example_data)

str(curated_data)
#> 'data.frame':	2 obs. of  6 variables:
#>  $ session_id : chr  "EP-2020-001" "EP-2020-001"
#>  $ speaker_id : chr  "MEP123" "MEP124"
#>  $ state      : chr  "FR" "DE"
#>  $ session_seq: chr  "1" "2"
#>  $ text       : chr  "This is a sample speech from France." "This is another speech from Germany."
#>  $ type       : chr  "test1" "test1"