This function allows you to input a vector of characters, or a mixture of files and folders, it will automatically detect file encodings, segment Chinese texts, do specified modification, remove stop words, and then generate corpus or dtm (tdm). Since tm does not support Chinese well, this function manages to solve some problems. See Details.
corp_or_dtm(
...,
from = "dir",
type = "corpus",
enc = "auto",
mycutter = DEFAULT_cutter,
stop_word = NULL,
stop_pattern = NULL,
control = "auto",
myfun1 = NULL,
myfun2 = NULL,
special = "",
use_stri_replace_all = FALSE
)
names of folders, files, or the mixture of the two kinds. It can also be a character
vector of texts to be processed when setting from
to "v", see below.
should be "dir" or "v". If your inputs are filenames, it should be "dir" (default),
If the input is a character vector of texts, it should be "v". However, if it is set to "v",
make sure each element is not identical to filename in your working
directory; and, if they are identical, the function will raise an error. To do this check is
because if they are identical, jiebaR::segment
will take the input as a file to read!
what do you want for result. It is case insensitive, thus those start with "c" or "C" represent a corpus result; and those start with "d" or "D" for document term matrix, and those start with "t" or "T" for term document matrix. Input other than the above represents a corpus result. The default value is "corpus".
a length 1 character specifying encoding when reading files. If your files may have different encodings, or you do not know their encodings, set it to "auto" (default) to let the function auto-detect encoding for each file.
the jiebar cutter to segment text. A default cutter is used. See Details.
a character vector to specify stop words that should be removed.
If it is NULL
, nothing is removed. If it is "jiebar", "jiebaR" or "auto", the stop words used by
jiebaR are used, see make_stoplist
.
Please note the default value is NULL
. Texts are transformed to lower case before
removing stop words, so your stop words only need to contain lower case characters.
vector of regular expressions. These patterns are similar to stop words.
Terms that match the patterns will be removed.
Note: the function will automatically adds "^" and "$" to the pattern, which means
first, the pattern you provide should not contain these two; second, the matching
is complete matching. That is to say, if a word is to be removed, it not just
contains the pattern (which is to be checked by grepl
, but the whole
word match the pattern.
a named list similar to that
which is used by DocumentTermMatrix
or TermDocumentMatrix
to create dtm or tdm. But
there are some significant differences.
Most of the time you do not need to
set this value because a default value is used. When you set the argument to NULL
,
it still points to this default value. See Details.
a function used to modify each text after being read by scancn
and before being segmented.
a function used to modify each text after they are segmented.
a length 1 character or regular expression to be passed to dir_or_file
to specify what pattern should be met by filenames. The default is to read all files.
See dir_or_file
.
default is FALSE. If it is TRUE,
stringi::stri_replace_all
is used to delete stop words, which has
a slightly higher speed. This is still experimental.
a corpus, or document term matrix, or term document matrix.
Package tm sometimes
tries to segment an already segmented Chinese Corpus and put together terms that
should not be put together. The function is to deal with the problem.
It calls scancn
to read files and
auto-detect file encodings,
and calls jiebaR::segment
to segment Chinese text, and finally
calls tm::Corpus
to generate corpus.
When creating DTM/TDM, it
partially depends on tm::DocumentTermMatrix
and tm::TermDocumentMatrix
, but also has some significant
differences in setting control argument.
Users should provide their jiebar cutter by mycutter
. Otherwise, the function
uses DEFAULT_cutter
which is created when the package is loaded.
The DEFAULT_cutter
is simply worker(write = FALSE)
.
See jiebaR::worker
.
As long as
you have not manually created another variable called "DEFAULT_cutter",
you can directly use jiebaR::new_user_word(DEFAULT_cutter...)
to add new words. By the way, whether you manually create an object
called "DEFAULT_cutter", the original loaded DEFAULT_cutter which is
used by default by functions in this package will not be removed by you.
So, whenever you want to use this default value, you do not need to set
mycutter
and keep it as default.
The argument control
is very similar to the argument used by
tm::DocumentTermMatrix
, but is quite different and will not be passed
to it! The permitted elements are below:
(1) wordLengths: length 2 positive integer vector. 0 and inf
is not allowed. If you only want words of 4 to 10, then set it to c(4, 10).
If you do not want to limit the ceiling value, just choose a large value,
e.g., c(4, 100).
In package tm (>= 0.7), 1 Chinese character is roughly
of length 2 (but not always computed by multiplying 2),
so if a Chinese words is of 4 characters, the min value
of wordLengths is 8. But here in corp_or_dtm
, word length is exactly
the same as what you see on the screen. So, a Chinese word with 4 characters is
of length 4 rather than 8.
(2) dictionary: a character vetcor of the words which will appear in DTM/TDM
when you do not want a full one. If none of the words in the dictionary appears in
corpus, a blank DTM/TDM will be created. The vector should not contain
NA
, if it does, only non-NA elements will be kept. Make sure at least 1
element is not NA
. Note: if both dictionary and wordLengths appear in
your control list, wordLengths will be ignored.
(3) bounds: an integer vector of length 2 which limits the term frequency
of words. Only words whose total frequencies are in this range will appear in
the DTM/TDM. 0 and inf
is not allowed. Let a large enough value to
indicate the unlimited ceiling.
(4) have: an integer vector of length 2 which limits the time a word appears in the corpus. Suppose a word appears 3 times in the 1st article and 2 times in the 2nd article, and 0 in the 3rd, then its bounds value = 3 + 2 + 0 = 5; but its have value = 1 + 1 + 0 = 2.
(5) weighting: a function to compute word weights. The default is to
compute term frequency. But you can use other weighting functions, typically
tm::weightBin
or tm::weightTfIdf
.
(6) tokenizer: this value is temporarily deprecated and it cannot be modified by users.
By default, the argument control
is set
to "auto", "auto1", or DEFAULT_control1
,
which are the same. This control list is created
when the package is loaded. It is simply list(wordLengths = c(1, 25))
,
Alternatively, DEFAULT_control2
(or "auto2") is also created
when loading package, which sets
word length to 2 to 25.
# NOT RUN {
x <- c(
"Hello, what do you want to drink?",
"drink a bottle of milk",
"drink a cup of coffee",
"drink some water")
# The simplest argument setting
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
# Modify argument control to see what happens
dtm <- corp_or_dtm(x, from = "v", type="d", control = list(wordLengths = c(3, 20)))
tdm <- corp_or_dtm(x, from = "v", type = "T", stop_word = c("you", "to", "a", "of"))
# }
Run the code above in your browser using DataLab