This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.
Usage
calc_assoc_metrics(
data,
doc_index,
token_index,
type,
association = "all",
verbose = FALSE
)
Arguments
- data
A data frame containing the corpus.
- doc_index
Column in 'data' which represents the document index.
- token_index
Column in 'data' which represents the token index.
- type
Column in 'data' which represents the tokens or terms.
- association
A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'.
- verbose
A logical value indicating whether to keep the intermediate probability columns. Default is FALSE.
Examples
data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit")
data <- readRDS(data_path)
calc_assoc_metrics(data, doc_index, token_index, type)
#> y x n pmi dice_coeff g_score
#> 1 word2 word1 1 1.7917595 0.6857143 -0.6061358
#> 2 word2 word3 1 0.6931472 0.4210526 -1.7047481
#> 3 word3 word2 2 1.3862944 0.8275862 -0.3184537
#> 4 word3 word4 1 0.2876821 0.3478261 -2.1102132
#> 5 word4 word3 2 0.9808293 0.6956522 -0.7239188
#> 6 word4 word5 1 0.6931472 0.4137931 -1.7047481
#> 7 word5 word4 2 1.3862944 0.8421053 -0.3184537
#> 8 word6 word5 1 1.7917595 0.7058824 -0.6061358