Skip to contents

Process and curate Switchboard Dialog Act (SWDA) data by reading all .utt files from a specified directory and converting them into a structured format.

Usage

curate_swda_data(dir_path)

Arguments

dir_path

Character string. Path to the directory containing .utt files. Must be an existing directory.

Value

A data frame containing the curated SWDA data with columns:

  • doc_id: Document identifier

  • damsl_tag: Dialog act annotation

  • speaker_id: Unique speaker identifier

  • speaker: Speaker designation (A or B)

  • turn_num: Turn number in conversation

  • utterance_num: Utterance number

  • utterance_text: Actual spoken text

Details

The function expects a directory containing .utt files or subdirectories with .utt files, as found in the raw SWDA data (Linguistic Data Consortium. LDC97S62: Switchboard Dialog Act Corpus.)

Examples

# Example using simulated data bundled with the package
example_data <- system.file("extdata", "simul_swda", package = "qtkit")
swda_data <- curate_swda_data(example_data)

str(swda_data)
#> 'data.frame':	54 obs. of  7 variables:
#>  $ doc_id        : chr  "4325" "4325" "4325" "4325" ...
#>  $ damsl_tag     : chr  "o" "qw" "qy^d" "+" ...
#>  $ speaker_id    : chr  "1632" "1632" "1519" "1632" ...
#>  $ speaker       : chr  "A" "A" "B" "A" ...
#>  $ turn_num      : chr  "1" "1" "2" "3" ...
#>  $ utterance_num : chr  "1" "2" "1" "1" ...
#>  $ utterance_text: chr  "Meep.  /" "{D Flim, }" "[ [ Ip gost, +" "Whip kax fo splurience [ fo yip, + fo yip ] haz, thun wip chog nare? /" ...