Skip to content

For each unique word in `words` the function:

  1. Computes the **mean value of `x`** (and optionally `y`) across all participants whose response contained that word.

  2. Looks up the **decontextualised embedding** for that word from `word_types_embeddings`.

  3. Trains a **ridge regression** model: embedding → mean x score. The out-of-sample predictions become the `x_plotted` plotting coordinate, allowing generalisation to words unseen in training.

  4. Optionally computes **permutation-based p-values** (see `n_permutations`) by shuffling `x` labels and building a null distribution of prediction scores.

The returned `word_data` tibble has column names that match the expectations of textProjectionPlot: `x_plotted` (and `y_plotted`) for coordinates and `p_values_x` (and `p_values_y`) for significance.

Usage

textWordPrediction(
  words,
  word_types_embeddings = word_types_embeddings_df,
  x,
  y = NULL,
  n_models = 25,
  n_permutations = 10000,
  seed = 1003,
  case_insensitive = TRUE,
  text_remove = "[()]",
  ...
)

Arguments

words

Character vector **or** single-column tibble of free-text responses (one per participant).

word_types_embeddings

Word-type embeddings from textEmbed — specifically the `$word_types` component. These are *decontextualised*: one fixed vector per unique word type.

x

Numeric vector (or single-column tibble) of the outcome variable to project words onto the x-axis (e.g., a well-being scale score).

y

Optional numeric vector for a second outcome to project onto the y-axis. Default NULL.

n_models

Number of null ridge regression models to fit, each trained on a *different* permuted x vector. Each null fit produces genuine cross-validated out-of-sample null scores — one per word. Determines p-value resolution: the minimum non-trivial p-value step is approximately 1/n_models (e.g., 0.04 with 25 models, which is just below alpha = 0.05). Default 25.

n_permutations

Number of bootstrap samples drawn from the n_models null scores to smooth the null distribution. Does not require additional model fits. Set to 0 to skip p-values entirely. Default 10000.

seed

Integer seed for reproducibility. Default 1003.

case_insensitive

Logical. If TRUE (default), word matching ignores capitalisation.

text_remove

Regex pattern for characters to strip before processing (e.g., brackets). Default "[()]".

...

Additional arguments forwarded to textTrainRegression.

Value

A named list:

model_x

The fitted textTrainRegression model for the x-axis.

model_y

(Only if `y` is supplied) Fitted model for the y-axis.

word_data

A tibble with one row per unique word containing: words, n (frequency), word_mean_value_x, x_plotted (embedding-based prediction), p_values_x; plus the y-equivalents when `y` is provided.

The comment attribute on the output stores a human-readable description of all call parameters for reproducibility.

Examples

if (FALSE) { # \dontrun{
library(text)

# --- Step 1: embed the text column (produces text-level + word-type embeddings)
embeddings <- textEmbed(Language_based_assessment_data_8["harmonywords"])

# --- Step 2: run textWordPrediction
result <- textWordPrediction(
  words                 = Language_based_assessment_data_8$harmonywords,
  word_types_embeddings = embeddings$word_types,
  x                     = Language_based_assessment_data_8$hilstotal,
  n_models              = 5,      # 5 real fits with different CV seeds
  n_permutations        = 10000,  # 5 × 10 000 = 50 000 total null samples
  seed                  = 1003
)

# --- Step 3: inspect word-level scores
result$word_data

# --- Step 4: pass directly to textProjectionPlot
textProjectionPlot(result)
} # }

GitHub