R/1_1_textEmbed.R
textEmbedLayersOutput.Rd
Extract layers of hidden states (word embeddings) for all character variables in a given dataframe.
textEmbedLayersOutput(
x,
contexts = TRUE,
single_context_embeddings = FALSE,
decontexts = TRUE,
model = "bert-base-uncased",
layers = 11,
return_tokens = TRUE,
device = "cpu",
tokenizer_parallelism = FALSE,
model_max_length = NULL,
logging_level = "error"
)
A character variable or a tibble/dataframe with at least one character variable.
Provide word embeddings based on word contexts (standard method; default = TRUE).
Aggregated contextualized word embeddings for each token in the text variables (only works for one text-variable at the time).
Provide word embeddings of single words as input (embeddings used for plotting; default = TRUE).
Character string specifying pre-trained language model (default 'bert-base-uncased'). For full list of options see pretrained models at HuggingFace. For example use "bert-base-multilingual-cased", "openai-gpt", "gpt2", "ctrl", "transfo-xl-wt103", "xlnet-base-cased", "xlm-mlm-enfr-1024", "distilbert-base-cased", "roberta-base", or "xlm-roberta-base".
Specify the layers that should be extracted (default 11). It is more efficient to only extract the layers that you need (e.g., 11). You can also extract several (e.g., 11:12), or all by setting this parameter to "all". Layer 0 is the decontextualized input layer (i.e., not comprising hidden states) and thus should normally not be used. These layers can then be aggregated in the textEmbedLayerAggregation function.
If TRUE, provide the tokens used in the specified transformer model.
Name of device to use: 'cpu', 'gpu', or 'gpu:k' where k is a specific device number
If TRUE this will turn on tokenizer parallelism. Default FALSE.
The maximum length (in number of tokens) for the inputs to the transformer model (default the value stored for the associated model).
Set the logging level. Default: "warning". Options (ordered from less logging to more logging): critical, error, warning, info, debug
A tibble with tokens, column specifying layer and word embeddings. Note that layer 0 is the input embedding to the transformer, and should normally not be used.
see textEmbedLayerAggregation
and textEmbed
# \donttest{
# x <- Language_based_assessment_data_8[1:2, 1:2]
# word_embeddings_with_layers <- textEmbedLayersOutput(x, layers = 11:12)
# }