Method new()
Usage
CountVectorizer$new(
min_df,
max_df,
max_features,
ngram_range,
regex,
remove_stopwords,
split,
lowercase
)
Arguments
min_df
numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_df
numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_features
integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range
vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regex
character, regex expression to use for text cleaning.
remove_stopwords
list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
split
character, splitting criteria for strings, default: " "
lowercase
logical, convert all characters to lowercase before tokenizing, default: TRUE
Details
Create a new `CountVectorizer` object.
Returns
A `CountVectorizer` object.
Examples
cv = CountVectorizer$new(min_df=0.1)
Method fit()
Usage
CountVectorizer$fit(sentences)
Arguments
sentences
a list of text sentences
Details
Fits the countvectorizer model on sentences
Examples
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
Arguments
sentences
a list of text sentences
Details
Fits the countvectorizer model and returns a sparse matrix of count of tokens
Returns
a sparse matrix containing count of tokens in each given sentence
Examples
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
Arguments
sentences
a list of new text sentences
Details
Returns a matrix of count of tokens
Returns
a sparse matrix containing count of tokens in each given sentence
Examples
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)
Method clone()
The objects of this class are cloneable with this method.
Usage
CountVectorizer$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.