Calculate Type Metrics for Text Data — calc_type

This function calculates type metrics for tokenized text data.

Usage

calc_type_metrics(data, type, documents, frequency = NULL, dispersion = NULL)

Arguments

data: A data frame containing the tokenized text data
type: The variable in data that contains the type (e.g., term, lemma) to analyze.
documents: The variable in data that contains the document IDs.
frequency: A character vector indicating which frequency metrics to use. If NULL (default), only the type and n are returned. Other options: 'all', 'rf' calculates relative frequency, 'orf' calculates observed relative frequency. Can specify multiple options: c("rf", "orf").
dispersion: A character vector indicating which dispersion metrics to use. If NULL (default), only the type and n are returned. Other options: 'all', 'df' calculates Document Frequency. 'idf' calculates Inverse Document Frequency. 'dp' calculates Gries' Deviation of Proportions. Can specify multiple options: c("df", "idf").

Value

A data frame with columns:

type: The unique types from the input data.
n: The frequency of each type across all documents. Optionally (based on the frequency and dispersion arguments):
rf: The relative frequency of each type across all documents.
orf: The observed relative frequency (per 100) of each type across all documents.
df: The document frequency of each type.
idf: The inverse document frequency of each type.
dp: Gries' Deviation of Proportions of each type.

References

Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.

Examples

if (FALSE) {
data <- data.frame(
  term = c("word1", "word1", "word2", "word2", "word2", "word3"),
  documents = c("doc1", "doc2", "doc1", "doc1", "doc2", "doc2")
)
calc_type_metrics(
  data = data,
  type = term,
  documents = documents,
  frequency = c("rf", "orf"),
  dispersion = c("df", "idf")
)
}