This function calculates type metrics for tokenized text data.
Arguments
- data
A data frame containing the tokenized text data
- type
The variable in
data
that contains the type (e.g., term, lemma) to analyze.- document
The variable in
data
that contains the document IDs.- frequency
A character vector indicating which frequency metrics to use. If NULL (default), only the
type
andn
are returned. Other options: 'all', 'rf' calculates relative frequency, 'orf' calculates observed relative frequency. Can specify multiple options: c("rf", "orf").- dispersion
A character vector indicating which dispersion metrics to use. If NULL (default), only the
type
andn
are returned. Other options: 'all', 'df' calculates Document Frequency. 'idf' calculates Inverse Document Frequency. 'dp' calculates Gries' Deviation of Proportions. Can specify multiple options: c("df", "idf").
Value
A data frame with columns:
type
: The unique types from the input data.n
: The frequency of each type across all document. Optionally (based on thefrequency
anddispersion
arguments):rf
: The relative frequency of each type across all document.orf
: The observed relative frequency (per 100) of each type across all document.df
: The document frequency of each type.idf
: The inverse document frequency of each type.dp
: Gries' Deviation of Proportions of each type.
References
Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.
Examples
data_path <- system.file("extdata", "types_data.rds", package = "qtkit")
data <- readRDS(data_path)
calc_type_metrics(
data = data,
type = type,
document = document,
frequency = c("rf", "orf"),
dispersion = c("df", "idf")
)
#> # A tibble: 5 × 6
#> type n rf orf df idf
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 A 20 0.2 20 3 0
#> 2 B 20 0.2 20 3 0
#> 3 C 20 0.2 20 3 0
#> 4 D 20 0.2 20 3 0
#> 5 E 20 0.2 20 3 0