
Compute word-level prediction scores for plotting with textProjectionPlot().
Source:R/4_4_textWordPrediction.R
textWordPrediction.RdFor each unique word in `words` the function:
Computes the **mean value of `x`** (and optionally `y`) across all participants whose response contained that word.
Looks up the **decontextualised embedding** for that word from `word_types_embeddings`.
Trains a **ridge regression** model: embedding → mean x score. The out-of-sample predictions become the `x_plotted` plotting coordinate, allowing generalisation to words unseen in training.
Optionally computes **permutation-based p-values** (see `n_permutations`) by shuffling `x` labels and building a null distribution of prediction scores.
The returned `word_data` tibble has column names that match the expectations
of textProjectionPlot: `x_plotted` (and `y_plotted`) for coordinates
and `p_values_x` (and `p_values_y`) for significance.
Usage
textWordPrediction(
words,
word_types_embeddings = word_types_embeddings_df,
x,
y = NULL,
n_models = 25,
n_permutations = 10000,
seed = 1003,
case_insensitive = TRUE,
text_remove = "[()]",
...
)Arguments
- words
Character vector **or** single-column tibble of free-text responses (one per participant).
- word_types_embeddings
Word-type embeddings from
textEmbed— specifically the `$word_types` component. These are *decontextualised*: one fixed vector per unique word type.- x
Numeric vector (or single-column tibble) of the outcome variable to project words onto the x-axis (e.g., a well-being scale score).
- y
Optional numeric vector for a second outcome to project onto the y-axis. Default
NULL.- n_models
Number of null ridge regression models to fit, each trained on a *different* permuted x vector. Each null fit produces genuine cross-validated out-of-sample null scores — one per word. Determines p-value resolution: the minimum non-trivial p-value step is approximately
1/n_models(e.g., 0.04 with 25 models, which is just below alpha = 0.05). Default25.- n_permutations
Number of bootstrap samples drawn from the
n_modelsnull scores to smooth the null distribution. Does not require additional model fits. Set to0to skip p-values entirely. Default10000.- seed
Integer seed for reproducibility. Default 1003.
- case_insensitive
Logical. If
TRUE(default), word matching ignores capitalisation.- text_remove
Regex pattern for characters to strip before processing (e.g., brackets). Default
"[()]".- ...
Additional arguments forwarded to
textTrainRegression.
Value
A named list:
- model_x
The fitted
textTrainRegressionmodel for the x-axis.- model_y
(Only if `y` is supplied) Fitted model for the y-axis.
- word_data
A tibble with one row per unique word containing:
words,n(frequency),word_mean_value_x,x_plotted(embedding-based prediction),p_values_x; plus the y-equivalents when `y` is provided.
The comment attribute on the output stores a human-readable description of all call parameters for reproducibility.
Examples
if (FALSE) { # \dontrun{
library(text)
# --- Step 1: embed the text column (produces text-level + word-type embeddings)
embeddings <- textEmbed(Language_based_assessment_data_8["harmonywords"])
# --- Step 2: run textWordPrediction
result <- textWordPrediction(
words = Language_based_assessment_data_8$harmonywords,
word_types_embeddings = embeddings$word_types,
x = Language_based_assessment_data_8$hilstotal,
n_models = 5, # 5 real fits with different CV seeds
n_permutations = 10000, # 5 × 10 000 = 50 000 total null samples
seed = 1003
)
# --- Step 3: inspect word-level scores
result$word_data
# --- Step 4: pass directly to textProjectionPlot
textProjectionPlot(result)
} # }
