tokenize: Tokenize raw text for training word embeddings.

Description

Tokenize raw text for training word embeddings.

Usage

tokenize(
  text,
  tokenizer = text2vec::word_tokenizer,
  split = " ",
  remove = "_|'|
|
|e\\.g\\.|i\\.e\\.",
  encoding = "UTF-8",
  simplify = TRUE,
  verbose = TRUE
)

Value

simplify=TRUE: A tokenized character vector, with each element as a sentence.
simplify=FALSE: A list of tokenized character vectors, with each element as a vector of tokens in a sentence.

Arguments

text: A character vector of text, or a file path on disk containing text.
tokenizer: Function used to tokenize the text. Defaults to text2vec::word_tokenizer.
split: Separator between tokens, only used when simplify=TRUE. Defaults to " ".
remove: Strings (in regular expression) to be removed from the text. Defaults to "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.". You may turn off this by specifying remove=NULL.
encoding: Text encoding (only used if text is a file). Defaults to "UTF-8".
simplify: Return a character vector (TRUE) or a list of character vectors (FALSE). Defaults to TRUE.
verbose: Print information to the console? Defaults to TRUE.

Examples

Run this code

txt1 = c(
  "I love natural language processing (NLP)!",
  "I've been in this city for 10 years. I really like here!",
  "However, my computer is not among the \"Top 10\" list."
)
tokenize(txt1, simplify=FALSE)
tokenize(txt1) %>% cat(sep="\n----\n")

txt2 = text2vec::movie_review$review[1:5]
texts = tokenize(txt2)

txt2[1]
texts[1:20]  # all sentences in txt2[1]

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples