Skip to contents

This function calculates type metrics for tokenized text data.

Usage

calc_type_metrics(data, type, document, frequency = NULL, dispersion = NULL)

Arguments

data

A data frame containing the tokenized text data

type

The variable in data that contains the type (e.g., term, lemma) to analyze.

document

The variable in data that contains the document IDs.

frequency

A character vector indicating which frequency metrics to use. If NULL (default), only the type and n are returned. Other options: 'all', 'rf' calculates relative frequency, 'orf' calculates observed relative frequency. Can specify multiple options: c("rf", "orf").

dispersion

A character vector indicating which dispersion metrics to use. If NULL (default), only the type and n are returned. Other options: 'all', 'df' calculates Document Frequency. 'idf' calculates Inverse Document Frequency. 'dp' calculates Gries' Deviation of Proportions. Can specify multiple options: c("df", "idf").

Value

A data frame with columns:

  • type: The unique types from the input data.

  • n: The frequency of each type across all document. Optionally (based on the frequency and dispersion arguments):

  • rf: The relative frequency of each type across all document.

  • orf: The observed relative frequency (per 100) of each type across all document.

  • df: The document frequency of each type.

  • idf: The inverse document frequency of each type.

  • dp: Gries' Deviation of Proportions of each type.

References

Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.

Examples

data_path <- system.file("extdata", "types_data.rds", package = "qtkit")
data <- readRDS(data_path)
calc_type_metrics(
  data = data,
  type = type,
  document = document,
  frequency = c("rf", "orf"),
  dispersion = c("df", "idf")
)
#> # A tibble: 5 × 6
#>   type      n    rf   orf    df   idf
#>   <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 A        20   0.2    20     3     0
#> 2 B        20   0.2    20     3     0
#> 3 C        20   0.2    20     3     0
#> 4 D        20   0.2    20     3     0
#> 5 E        20   0.2    20     3     0