EXPERIMENTAL: Test whether there is a significant difference in meaning between two sets of texts (i.e., between their word embeddings).

textSimilarityTest(
  x,
  y,
  similarity_method = "cosine",
  Npermutations = 10000,
  method = "paired",
  center = FALSE,
  scale = FALSE,
  alternative = "greater",
  output.permutations = TRUE,
  N_cluster_nodes = 1,
  seed = 1001
)

Arguments

x: Set of word embeddings from textEmbed.
y: Set of word embeddings from textEmbed.
similarity_method: Character string describing type of measure to be computed; default is "cosine" (see also measures from textDistance (here computed as 1 - textDistance()) including "euclidean", "maximum", "manhattan", "canberra", "binary" and "minkowski").
Npermutations: Number of permutations (default 10000).
method: Compute a "paired" or an "unpaired" test.
center: (boolean; from base::scale) If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns, and if center is FALSE, no centering is done.
scale: (boolean; from base::scale) If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise.
alternative: Use a two or one-sided test (select one of: "two_sided", "less", "greater").
output.permutations: If TRUE, returns permuted values in output.
N_cluster_nodes: Number of cluster nodes to use (more makes computation faster; see parallel package).
seed: Set different seed.

Value

A list with a p-value, similarity score estimate and permuted values if output.permutations=TRUE.

Examples

x <- word_embeddings_4$texts$harmonywords
y <- word_embeddings_4$texts$satisfactionwords
textSimilarityTest(x,
  y,
  method = "paired",
  Npermutations = 100,
  N_cluster_nodes = 1,
  alternative = "two_sided"
)
#> $random.estimates.4.null
#>   [1] 0.4833325 0.5452712 0.5288879 0.5523950 0.5062643 0.5056589 0.5299288
#>   [8] 0.5294611 0.5041479 0.5638069 0.5444502 0.5143583 0.5869974 0.5448591
#>  [15] 0.5665790 0.5951618 0.5209914 0.5354357 0.5298155 0.5636176 0.5076976
#>  [22] 0.5027946 0.5477583 0.5202374 0.5805853 0.5102602 0.5081500 0.5649387
#>  [29] 0.5302362 0.5318751 0.5368331 0.5172774 0.5243406 0.5460461 0.5398606
#>  [36] 0.5332219 0.5531051 0.5641066 0.5174095 0.5381908 0.5483835 0.5433617
#>  [43] 0.5427775 0.5554496 0.5259649 0.5084208 0.6094677 0.5571163 0.5698054
#>  [50] 0.5936454 0.5773273 0.5322270 0.5134633 0.5458664 0.5348058 0.5241091
#>  [57] 0.4600656 0.5288319 0.5120169 0.5505340 0.5359313 0.5515763 0.5001314
#>  [64] 0.5304822 0.5226138 0.5504688 0.5121119 0.5488821 0.5652147 0.5097908
#>  [71] 0.5927481 0.4933408 0.5266507 0.5426173 0.5244023 0.5334119 0.5670806
#>  [78] 0.5722617 0.4648230 0.5310122 0.5302875 0.5128809 0.5556027 0.5753363
#>  [85] 0.5827712 0.5493311 0.5374056 0.5789103 0.5447103 0.5477342 0.4953800
#>  [92] 0.4887862 0.5565930 0.5247231 0.5654468 0.5284231 0.4555032 0.5716095
#>  [99] 0.5198858 0.5346006
#> 
#> $embedding_x
#> [1] "x : Information about the embeddings. textEmbedRawLayers: model: bert-base-uncased ; layers: 11 ; word_type_embeddings: TRUE ; max_token_to_sentence: 4 ; text_version: 0.9.99. textEmbedLayerAggregation: layers =  11 aggregation_from_layers_to_tokens =  concatenate aggregation_from_tokens_to_texts =  mean tokens_select =   tokens_deselect =   "
#> 
#> $embedding_y
#> [1] "y : Information about the embeddings. textEmbedRawLayers: model: bert-base-uncased ; layers: 11 ; word_type_embeddings: TRUE ; max_token_to_sentence: 4 ; text_version: 0.9.99. textEmbedLayerAggregation: layers =  11 aggregation_from_layers_to_tokens =  concatenate aggregation_from_tokens_to_texts =  mean tokens_select =   tokens_deselect =   "
#> 
#> $test_description
#> [1] "permutations =  100 similarity_method =  cosine method =  paired center =  FALSE scale =  FALSE alternative =  two_sided"
#> 
#> $time_date
#> [1] "Duration to run the test: 0.690457 secs; Date created: 2022-09-20 15:51:07"
#> 
#> $cosine_estimate
#> [1] 0.606931
#> 
#> $p.value
#> [1] 0.02
#>