This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.
Usage
calc_assoc_metrics(
data,
doc_index,
token_index,
type,
association = "all",
verbose = FALSE
)
Arguments
- data
A data frame containing the corpus.
- doc_index
Column in 'data' which represents the document index.
- token_index
Column in 'data' which represents the token index.
- type
Column in 'data' which represents the tokens or terms.
- association
A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'.
- verbose
A logical value indicating whether to keep the intermediate probability columns. Default is FALSE.
Examples
data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit")
data <- readRDS(data_path)
calc_assoc_metrics(data, doc_index, token_index, type)
#> y x pmi dice_coeff g_score
#> 1 word2 word1 1.8787708 0.7272727 -0.5191244
#> 2 word2 word2 -Inf 0.0000000 -Inf
#> 3 word2 word3 0.7801586 0.4363636 -1.6177367
#> 4 word2 word4 -Inf 0.0000000 -Inf
#> 5 word2 word5 -Inf 0.0000000 -Inf
#> 6 word3 word1 -Inf 0.0000000 -Inf
#> 7 word3 word2 1.4733057 0.8727273 -0.2314424
#> 8 word3 word3 -Inf 0.0000000 -Inf
#> 9 word3 word4 0.3746934 0.3636364 -2.0232018
#> 10 word3 word5 -Inf 0.0000000 -Inf
#> 11 word4 word1 -Inf 0.0000000 -Inf
#> 12 word4 word2 -Inf 0.0000000 -Inf
#> 13 word4 word3 1.0678406 0.7272727 -0.6369075
#> 14 word4 word4 -Inf 0.0000000 -Inf
#> 15 word4 word5 0.7801586 0.4363636 -1.6177367
#> 16 word5 word1 -Inf 0.0000000 -Inf
#> 17 word5 word2 -Inf 0.0000000 -Inf
#> 18 word5 word3 -Inf 0.0000000 -Inf
#> 19 word5 word4 1.4733057 0.8727273 -0.2314424
#> 20 word5 word5 -Inf 0.0000000 -Inf
#> 21 word6 word1 -Inf 0.0000000 -Inf
#> 22 word6 word2 -Inf 0.0000000 -Inf
#> 23 word6 word3 -Inf 0.0000000 -Inf
#> 24 word6 word4 -Inf 0.0000000 -Inf
#> 25 word6 word5 1.8787708 0.7272727 -0.5191244