Skip to contents

This function calculates various association metrics (PMI, Dice's Coefficient, Lambda-Rank) for bigrams in a given corpus. The data frame must contain document and token indices, as well as a 'type' variable representing the tokens.

Usage

calc_assoc_metrics(
  data,
  doc_index,
  token_index,
  type,
  association = "all",
  verbose = FALSE
)

Arguments

data

A data frame containing the corpus.

doc_index

A string name of the column in 'data' which represents the document index.

token_index

A string name of the column in 'data' which represents the token index.

type

A string name of the column in 'data' which represents the tokens or terms.

association

A character vector specifying which metrics to calculate. Can be any combination of 'pmi' (Pointwise Mutual Information), 'dice_coeff' (Dice's Coefficient), 'g_score' (G-score), or 'all' (calculate all metrics). Default is 'all'.

verbose

A logical value indicating whether to keep the intermediate probability columns ('p_xy', 'p_x', 'p_y') in the result. Default is FALSE.

Value

A data frame with one row per bigram and columns for each calculated metric. If 'verbose' is TRUE, the intermediate probabilities used in the calculations are also included.

Examples

if (FALSE) {
library(dplyr)
data <- tibble::tibble(
  doc_index = c(1, 1, 1, 2),
  token_index = c(1, 2, 3, 1),
  type = c("word1", "word2", "word3", "word2")
)
calc_assoc_metrics(data, doc_index, token_index, type)
}