Learn R Programming

PsychWordVec (version 2025.3)

tokenize: Tokenize raw text for training word embeddings.

Description

Tokenize raw text for training word embeddings.

Usage

tokenize(
  text,
  tokenizer = text2vec::word_tokenizer,
  split = " ",
  remove = "_|'|
|
|e\\.g\\.|i\\.e\\.", encoding = "UTF-8", simplify = TRUE, verbose = TRUE )

Value

  • simplify=TRUE: A tokenized character vector, with each element as a sentence.

  • simplify=FALSE: A list of tokenized character vectors, with each element as a vector of tokens in a sentence.

Arguments

text

A character vector of text, or a file path on disk containing text.

tokenizer

Function used to tokenize the text. Defaults to text2vec::word_tokenizer.

split

Separator between tokens, only used when simplify=TRUE. Defaults to " ".

remove

Strings (in regular expression) to be removed from the text. Defaults to "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.". You may turn off this by specifying remove=NULL.

encoding

Text encoding (only used if text is a file). Defaults to "UTF-8".

simplify

Return a character vector (TRUE) or a list of character vectors (FALSE). Defaults to TRUE.

verbose

Print information to the console? Defaults to TRUE.

See Also

train_wordvec

Examples

Run this code
txt1 = c(
  "I love natural language processing (NLP)!",
  "I've been in this city for 10 years. I really like here!",
  "However, my computer is not among the \"Top 10\" list."
)
tokenize(txt1, simplify=FALSE)
tokenize(txt1) %>% cat(sep="\n----\n")

txt2 = text2vec::movie_review$review[1:5]
texts = tokenize(txt2)

txt2[1]
texts[1:20]  # all sentences in txt2[1]

Run the code above in your browser using DataLab