Learn R Programming

quanteda (version 0.9.9-50)

corpus_trimsentences: remove sentences based on their token lengths or a pattern match


Removes sentences from a corpus or a character vector shorter than a specified length.


corpus_trimsentences(x, min_length = 1, max_length = 10000,
  exclude_pattern = NULL, return_tokens = FALSE)

char_trimsentences(x, min_length = 1, max_length = 10000, exclude_pattern = NULL)


corpus or character object whose sentences will be selected.
min_length, max_length
minimum and maximum lengths in word tokens (excluding punctuation)
a stringi regular expression whose match (at the sentence level) will be used to exclude sentences
if TRUE, return tokens object of sentences after trimming, otherwise return the input object type with the trimmed sentences removed.


a corpus or character vector equal in length to the input, or a tokenized set of sentences if . If the input was a corpus, then the all docvars and metadata are preserved. For documents whose sentences have been removed entirely, a null string ("") will be returned.


Run this code
txt <- c("PAGE 1. This is a single sentence.  Short sentence. Three word sentence.",
         "PAGE 2. Very short! Shorter.",
         "Very long sentence, with multiple parts, separated by commas.  PAGE 3.")
mycorp <- corpus(txt, docvars = data.frame(serial = 1:3))

# exclude sentences shorter than 3 tokens
texts(corpus_trimsentences(mycorp, min_length = 3))
# exclude sentences that start with "PAGE <digit(s)>"
texts(corpus_trimsentences(mycorp, exclude_pattern = "^PAGE \\d+"))

# on a character
char_trimsentences(txt, min_length = 3)
char_trimsentences(txt, min_length = 3)
char_trimsentences(txt, exclude_pattern = "sentence\\.")

Run the code above in your browser using DataLab