Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other lexRank
functions.
sentenceTokenParse(text, docId = "create", removePunc = TRUE,
removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
rmStopWords = TRUE)
A character vector of documents to be parsed into sentences and tokenized.
A character vector of document Ids the same length as text
. If docId=="create"
document Ids will be created.
TRUE
or FALSE
indicating whether or not to remove punctuation from text
while tokenizing. If TRUE
, punctuation will be removed. Defaults to TRUE
.
TRUE
or FALSE
indicating whether or not to remove numbers from text
while tokenizing. If TRUE
, numbers will be removed. Defaults to TRUE
.
TRUE
or FALSE
indicating whether or not to coerce all of text
to lowercase while tokenizing. If TRUE
, text
will be coerced to lowercase. Defaults to TRUE
.
TRUE
or FALSE
indicating whether or not to stem resulting tokens. If TRUE
, the outputted tokens will be tokenized using SnowballC::wordStem()
. Defaults to TRUE
.
TRUE
, FALSE
, or character vector of stopwords to remove from tokens. If TRUE
, words in lexRankr::smart_stopwords
will be removed prior to stemming. If FALSE
, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE
.
A list of dataframes. The first element of the list returned is the sentences
dataframe; this dataframe has columns docId
, sentenceId
, & sentence
(the actual text of the sentence). The second element of the list returned is the tokens
dataframe; this dataframe has columns docId
, sentenceId
, & token
(the actual text of the token).
# NOT RUN {
sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."),
docId=c("d1","d2"))
# }
Run the code above in your browser using DataLab