Skip to contents

This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.

Usage

calc_assoc_metrics(
  data,
  doc_index,
  token_index,
  type,
  association = "all",
  verbose = FALSE
)

Arguments

data

A data frame containing the corpus.

doc_index

Column in 'data' which represents the document index.

token_index

Column in 'data' which represents the token index.

type

Column in 'data' which represents the tokens or terms.

association

A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'.

verbose

A logical value indicating whether to keep the intermediate probability columns. Default is FALSE.

Value

A data frame with one row per bigram and columns for each calculated metric.

Examples

data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit")
data <- readRDS(data_path)

calc_assoc_metrics(data, doc_index, token_index, type)
#>       y     x n       pmi dice_coeff    g_score
#> 1 word2 word1 1 1.7917595  0.6857143 -0.6061358
#> 2 word2 word3 1 0.6931472  0.4210526 -1.7047481
#> 3 word3 word2 2 1.3862944  0.8275862 -0.3184537
#> 4 word3 word4 1 0.2876821  0.3478261 -2.1102132
#> 5 word4 word3 2 0.9808293  0.6956522 -0.7239188
#> 6 word4 word5 1 0.6931472  0.4137931 -1.7047481
#> 7 word5 word4 2 1.3862944  0.8421053 -0.3184537
#> 8 word6 word5 1 1.7917595  0.7058824 -0.6061358