This function transforms texts into words, calculate frequencies, supress stop words in a given language.
textTokenizer(
text,
exclude = NULL,
lang = NULL,
min_word_freq = 5,
min_word_len = 2,
keep_spaces = FALSE,
lowercase = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_lettt = TRUE,
laughs = TRUE,
utf = TRUE,
df = FALSE,
h2o = FALSE,
quiet = FALSE
)
data.frame. Tokenized words with counters.
Character vector. Sentences or texts you wish to tokenize.
Character vector. Which words do you wish to exclude?
Character. Language in text (used for stop words). Example:
"spanish" or "english". Set to NA
to ignore.
Integer. This will discard words that appear
less than <int> times. Defaults to 2. Set to NA
to ignore.
Integer. This will discard words that have
less than <int> characters. Defaults to 5. Set to NA
to ignore.
Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word.
Boolean.
Boolean. Repeated letters (more than 3 consecutive).
Boolean. Try to unify all laughs texts.
Boolean. Transform all characters to UTF (no accents and crazy symbols)
Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained.
Boolean. Return H2OFrame
?
Boolean. Keep quiet? If not, print messages
Other Data Wrangling:
balance_data()
,
categ_reducer()
,
cleanText()
,
date_cuts()
,
date_feats()
,
file_name()
,
formatHTML()
,
holidays()
,
impute()
,
left()
,
normalize()
,
num_abbr()
,
ohe_commas()
,
ohse()
,
quants()
,
removenacols()
,
replaceall()
,
replacefactor()
,
textFeats()
,
vector2text()
,
year_month()
,
zerovar()
Other Text Mining:
cleanText()
,
ngrams()
,
remove_stopwords()
,
replaceall()
,
sentimentBreakdown()
,
textCloud()
,
textFeats()
,
topics_rake()