This function calculates type metrics for tokenized text data.
Arguments
- data
A data frame containing the tokenized text data
- type
The variable in
datathat contains the type (e.g., term, lemma) to analyze.- documents
The variable in
datathat contains the document IDs.- frequency
A character vector indicating which frequency metrics to use. If NULL (default), only the
typeandnare returned. Other options: 'all', 'rf' calculates relative frequency, 'orf' calculates observed relative frequency. Can specify multiple options: c("rf", "orf").- dispersion
A character vector indicating which dispersion metrics to use. If NULL (default), only the
typeandnare returned. Other options: 'all', 'df' calculates Document Frequency. 'idf' calculates Inverse Document Frequency. 'dp' calculates Gries' Deviation of Proportions. Can specify multiple options: c("df", "idf").
Value
A data frame with columns:
type: The unique types from the input data.n: The frequency of each type across all documents. Optionally (based on thefrequencyanddispersionarguments):rf: The relative frequency of each type across all documents.orf: The observed relative frequency (per 100) of each type across all documents.df: The document frequency of each type.idf: The inverse document frequency of each type.dp: Gries' Deviation of Proportions of each type.
References
Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.
Examples
if (FALSE) {
data <- data.frame(
term = c("word1", "word1", "word2", "word2", "word2", "word3"),
documents = c("doc1", "doc2", "doc1", "doc1", "doc2", "doc2")
)
calc_type_metrics(
data = data,
type = term,
documents = documents,
frequency = c("rf", "orf"),
dispersion = c("df", "idf")
)
}