Skip to contents

This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.

Usage

calc_assoc_metrics(
  data,
  doc_index,
  token_index,
  type,
  association = "all",
  verbose = FALSE
)

Arguments

data

A data frame containing the corpus.

doc_index

Column in 'data' which represents the document index.

token_index

Column in 'data' which represents the token index.

type

Column in 'data' which represents the tokens or terms.

association

A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'.

verbose

A logical value indicating whether to keep the intermediate probability columns. Default is FALSE.

Value

A data frame with one row per bigram and columns for each calculated metric.

Examples

data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit")
data <- readRDS(data_path)

calc_assoc_metrics(data, doc_index, token_index, type)
#>        y     x       pmi dice_coeff    g_score
#> 1  word2 word1 1.8787708  0.7272727 -0.5191244
#> 2  word2 word2      -Inf  0.0000000       -Inf
#> 3  word2 word3 0.7801586  0.4363636 -1.6177367
#> 4  word2 word4      -Inf  0.0000000       -Inf
#> 5  word2 word5      -Inf  0.0000000       -Inf
#> 6  word3 word1      -Inf  0.0000000       -Inf
#> 7  word3 word2 1.4733057  0.8727273 -0.2314424
#> 8  word3 word3      -Inf  0.0000000       -Inf
#> 9  word3 word4 0.3746934  0.3636364 -2.0232018
#> 10 word3 word5      -Inf  0.0000000       -Inf
#> 11 word4 word1      -Inf  0.0000000       -Inf
#> 12 word4 word2      -Inf  0.0000000       -Inf
#> 13 word4 word3 1.0678406  0.7272727 -0.6369075
#> 14 word4 word4      -Inf  0.0000000       -Inf
#> 15 word4 word5 0.7801586  0.4363636 -1.6177367
#> 16 word5 word1      -Inf  0.0000000       -Inf
#> 17 word5 word2      -Inf  0.0000000       -Inf
#> 18 word5 word3      -Inf  0.0000000       -Inf
#> 19 word5 word4 1.4733057  0.8727273 -0.2314424
#> 20 word5 word5      -Inf  0.0000000       -Inf
#> 21 word6 word1      -Inf  0.0000000       -Inf
#> 22 word6 word2      -Inf  0.0000000       -Inf
#> 23 word6 word3      -Inf  0.0000000       -Inf
#> 24 word6 word4      -Inf  0.0000000       -Inf
#> 25 word6 word5 1.8787708  0.7272727 -0.5191244