R/3_2_textSimilarityTest.R
textSimilarityTest.Rd
EXPERIMENTAL: Test whether there is a significant difference in meaning between two sets of texts (i.e., between their word embeddings).
textSimilarityTest(
x,
y,
similarity_method = "cosine",
Npermutations = 10000,
method = "paired",
center = FALSE,
scale = FALSE,
alternative = "greater",
output.permutations = TRUE,
N_cluster_nodes = 1,
seed = 1001
)
Set of word embeddings from textEmbed.
Set of word embeddings from textEmbed.
Character string describing type of measure to be computed; default is "cosine" (see also measures from textDistance (here computed as 1 - textDistance()) including "euclidean", "maximum", "manhattan", "canberra", "binary" and "minkowski").
Number of permutations (default 10000).
Compute a "paired" or an "unpaired" test.
(boolean; from base::scale) If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns, and if center is FALSE, no centering is done.
(boolean; from base::scale) If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise.
Use a two or one-sided test (select one of: "two_sided", "less", "greater").
If TRUE, returns permuted values in output.
Number of cluster nodes to use (more makes computation faster; see parallel package).
Set different seed.
A list with a p-value, similarity score estimate and permuted values if output.permutations=TRUE.
x <- word_embeddings_4$texts$harmonywords
y <- word_embeddings_4$texts$satisfactionwords
textSimilarityTest(x,
y,
method = "paired",
Npermutations = 100,
N_cluster_nodes = 1,
alternative = "two_sided"
)
#> $random.estimates.4.null
#> [1] 0.4833325 0.5452712 0.5288879 0.5523950 0.5062643 0.5056589 0.5299288
#> [8] 0.5294611 0.5041479 0.5638069 0.5444502 0.5143583 0.5869974 0.5448591
#> [15] 0.5665790 0.5951618 0.5209914 0.5354357 0.5298155 0.5636176 0.5076976
#> [22] 0.5027946 0.5477583 0.5202374 0.5805853 0.5102602 0.5081500 0.5649387
#> [29] 0.5302362 0.5318751 0.5368331 0.5172774 0.5243406 0.5460461 0.5398606
#> [36] 0.5332219 0.5531051 0.5641066 0.5174095 0.5381908 0.5483835 0.5433617
#> [43] 0.5427775 0.5554496 0.5259649 0.5084208 0.6094677 0.5571163 0.5698054
#> [50] 0.5936454 0.5773273 0.5322270 0.5134633 0.5458664 0.5348058 0.5241091
#> [57] 0.4600656 0.5288319 0.5120169 0.5505340 0.5359313 0.5515763 0.5001314
#> [64] 0.5304822 0.5226138 0.5504688 0.5121119 0.5488821 0.5652147 0.5097908
#> [71] 0.5927481 0.4933408 0.5266507 0.5426173 0.5244023 0.5334119 0.5670806
#> [78] 0.5722617 0.4648230 0.5310122 0.5302875 0.5128809 0.5556027 0.5753363
#> [85] 0.5827712 0.5493311 0.5374056 0.5789103 0.5447103 0.5477342 0.4953800
#> [92] 0.4887862 0.5565930 0.5247231 0.5654468 0.5284231 0.4555032 0.5716095
#> [99] 0.5198858 0.5346006
#>
#> $embedding_x
#> [1] "x : Information about the embeddings. textEmbedRawLayers: model: bert-base-uncased ; layers: 11 ; word_type_embeddings: TRUE ; max_token_to_sentence: 4 ; text_version: 0.9.99. textEmbedLayerAggregation: layers = 11 aggregation_from_layers_to_tokens = concatenate aggregation_from_tokens_to_texts = mean tokens_select = tokens_deselect = "
#>
#> $embedding_y
#> [1] "y : Information about the embeddings. textEmbedRawLayers: model: bert-base-uncased ; layers: 11 ; word_type_embeddings: TRUE ; max_token_to_sentence: 4 ; text_version: 0.9.99. textEmbedLayerAggregation: layers = 11 aggregation_from_layers_to_tokens = concatenate aggregation_from_tokens_to_texts = mean tokens_select = tokens_deselect = "
#>
#> $test_description
#> [1] "permutations = 100 similarity_method = cosine method = paired center = FALSE scale = FALSE alternative = two_sided"
#>
#> $time_date
#> [1] "Duration to run the test: 0.690457 secs; Date created: 2022-09-20 15:51:07"
#>
#> $cosine_estimate
#> [1] 0.606931
#>
#> $p.value
#> [1] 0.02
#>