Skip to contents

Calculates various frequency and dispersion metrics for types (terms/tokens) in tokenized text data. Provides a comprehensive analysis of how types are distributed across documents in a corpus.

Usage

calc_type_metrics(data, type, document, frequency = NULL, dispersion = NULL)

Arguments

data

Data frame. Contains the tokenized text data with document IDs and types/terms.

type

Symbol. Column in data containing the types to analyze (e.g., terms, lemmas).

document

Symbol. Column in data containing the document identifiers.

frequency

Character vector. Frequency metrics to calculate: - NULL (default): Returns only type counts - 'all': All available metrics - 'rf': Relative frequency - 'orf': Observed relative frequency (per 100)

dispersion

Character vector. Dispersion metrics to calculate: - NULL (default): Returns only type counts - 'all': All available metrics - 'df': Document frequency - 'idf': Inverse document frequency - 'dp': Gries' deviation of proportions

Value

Data frame containing requested metrics:

  • type: Unique types from input data

  • n: Raw frequency count

  • rf: Relative frequency (if requested)

  • orf: Observed relative frequency per 100 (if requested)

  • df: Document frequency (if requested)

  • idf: Inverse document frequency (if requested)

  • dp: Deviation of proportions (if requested)

Details

The function creates a term-document matrix internally and calculates the requested metrics. Frequency metrics show how often types occur, while dispersion metrics show how evenly they are distributed across documents.

The 'dp' metric (Gries' Deviation of Proportions) ranges from 0 (perfectly even distribution) to 1 (completely clumped distribution).

References

Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.

Examples

data_path <- system.file("extdata", "types_data.rds", package = "qtkit")
df <- readRDS(data_path)
calc_type_metrics(
  data = df,
  type = letter,
  document = doc_id,
  frequency = c("rf", "orf"),
  dispersion = "dp"
)
#>   type  n  rf orf        dp
#> A    A 20 0.2  20 0.1176471
#> B    B 20 0.2  20 0.1029412
#> C    C 20 0.2  20 0.1764706
#> D    D 20 0.2  20 0.1176471
#> E    E 20 0.2  20 0.1176471