tokenize: A simple tokenizer

Description

This tokenizer can be used to try replace TreeTagger. Its results are not as detailed when it comes to word classes, and no lemmatization is done. However, for most cases this should suffice.

Usage

tokenize(
  txt,
  format = "file",
  fileEncoding = NULL,
  split = "[[:space:]]",
  ign.comp = "-",
  heuristics = "abbr",
  heur.fix = list(pre = c("\u2019", "'"), suf = c("\u2019", "'")),
  abbrev = NULL,
  tag = TRUE,
  lang = "kRp.env",
  sentc.end = c(".", "!", "?", ";", ":"),
  detect = c(parag = FALSE, hline = FALSE),
  clean.raw = NULL,
  perl = FALSE,
  stopwords = NULL,
  stemmer = NULL,
  doc_id = NA,
  add.desc = "kRp.env",
  ...
)
# S4 method for character
tokenize(
  txt,
  format = "file",
  fileEncoding = NULL,
  split = "[[:space:]]",
  ign.comp = "-",
  heuristics = "abbr",
  heur.fix = list(pre = c("\u2019", "'"), suf = c("\u2019", "'")),
  abbrev = NULL,
  tag = TRUE,
  lang = "kRp.env",
  sentc.end = c(".", "!", "?", ";", ":"),
  detect = c(parag = FALSE, hline = FALSE),
  clean.raw = NULL,
  perl = FALSE,
  stopwords = NULL,
  stemmer = NULL,
  doc_id = NA,
  add.desc = "kRp.env"
)
# S4 method for kRp.connection
tokenize(
  txt,
  format = NA,
  fileEncoding = NULL,
  split = "[[:space:]]",
  ign.comp = "-",
  heuristics = "abbr",
  heur.fix = list(pre = c("\u2019", "'"), suf = c("\u2019", "'")),
  abbrev = NULL,
  tag = TRUE,
  lang = "kRp.env",
  sentc.end = c(".", "!", "?", ";", ":"),
  detect = c(parag = FALSE, hline = FALSE),
  clean.raw = NULL,
  perl = FALSE,
  stopwords = NULL,
  stemmer = NULL,
  doc_id = NA,
  add.desc = "kRp.env"
)

Arguments

txt

Either an open connection, the path to directory with txt files to read and tokenize, or a vector object already holding the text corpus.

format

Either "file" or "obj", depending on whether you want to scan files or analyze the given object. Ignored if txt is a connection.

fileEncoding

A character string naming the encoding of all files.

split

A regular expression to define the basic split method. Should only need refinement for languages that don't separate words by space.

ign.comp

A character vector defining punctuation which might be used in composita that should not be split.

heuristics

A vector to indicate if the tokenizer should use some heuristics. Can be none, one or several of the following:

"abbr"Assume that "letter-dot-letter-dot" combinations are abbreviations and leave them intact.
"suf"Try to detect possesive suffixes like "'s", or shorting suffixes like "'ll" and treat them as one token
"pre"Try to detect prefixes like "s'" or "l'" and treat them as one token

Earlier releases used the names "en" and "fr" instead of "suf" and "pre". They are still working, that is "en" is equivalent to "suf", whereas "fr" is now equivalent to both "suf" and "pre" (and not only "pre" as in the past, which was missing the use of suffixes in French).

heur.fix

A list with the named vectors pre and suf. These will be used if heuristics were set to use one of the presets that try to detect pre- and/or suffixes. Change them if you document uses other characters than the ones defined by default.

abbrev

Path to a text file with abbreviations to take care of, one per line. Note that this file must have the same encoding as defined by fileEncoding.

tag

Logical. If TRUE, the text will be rudimentarily tagged and returned as an object of class kRp.text.

lang

A character string naming the language of the analyzed text. If set to "kRp.env" this is fetched from get.kRp.env. Only needed if tag=TRUE.

sentc.end

A character vector with tokens indicating a sentence ending. Only needed if tag=TRUE.

detect

A named logical vector, indicating by the setting of parag and hline whether tokenize should try to detect paragraphs and headlines.

clean.raw

A named list of character values, indicating replacements that should globally be made to the text prior to tokenizing it. This is applied after the text was converted into UTF-8 internally. In the list, the name of each element represents a pattern which is replaced by its value if met in the text. Since this is done by calling gsub, regular expressions are basically supported. See the perl attribute, too.

perl

Logical, only relevant if clean.raw is not NULL. If perl=TRUE, this is forwarded to gsub to allow for perl-like regular expressions in clean.raw.

stopwords

A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set stopwords=tm::stopwords("en") to use the english stopwords provided by the tm package.

stemmer

A function or method to perform stemming. For instance, you can set SnowballC::wordStem if you have the SnowballC package installed. As of now, you cannot provide further arguments to this function.

doc_id

Character string, optional identifier of the particular document. Will be added to the desc slot, and as a factor to the "doc_id" column of the tokens slot. If NA, the document name will be used (for format="obj" a random name).

add.desc

Logical. If TRUE, the tag description (column "desc" of the data.frame) will be added directly to the resulting object. If set to "kRp.env" this is fetched from get.kRp.env. Only needed if tag=TRUE.

...

Only used for the method generic.

Value

If tag=FALSE, a character vector with the tokenized text. If tag=TRUE, returns an object of class kRp.text.

Details

tokenize can try to guess what's a headline and where a paragraph was inserted (via the detect parameter). A headline is assumed if a line of text without sentence ending punctuation is found, a paragraph if two blocks of text are separated by space. This will add extra tags into the text: "<kRp.h>" (headline starts), "</kRp.h>" (headline ends) and "<kRp.p/>" (paragraph), respectively. This can be useful in two cases: "</kRp.h>" will be treated like a sentence ending, which gives you more control for automatic analyses. And adding to that, pasteText can replace these tags, which probably preserves more of the original layout.

Examples

Run this code

# NOT RUN {
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )

  ## character manipulation
  # this is useful if you know of problematic characters in your
  # raw text files, but don't want to touch them directly. you
  # don't have to, as you can substitute them, even using regular
  # expressions. a simple example: replace all single quotes by
  # double quotes througout the text:
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en",
    clean.raw=list("'"='\"')
  )

  # now replace all occurrances of the letter A followed
  # by two digits with the letter B, followed by the same
  # two digits:
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en",
    clean.raw=list("(A)([[:digit:]]{2})"="B\\2"),
    perl=TRUE
  )

  ## enabling stopword detection and stemming
  if(all(
    requireNamespace("tm", quietly=TRUE),
    requireNamespace("SnowballC", quietly=TRUE)
  )){
    # if you also installed the packages tm and Snowball,
    # you can use some of their features with koRpus:
    tokenized.obj <- tokenize(
      txt=sample_file,
      lang="en",
      stopwords=tm::stopwords("en"),
      stemmer=SnowballC::wordStem
    )

    # removing all stopwords now is simple:
    tokenized.noStopWords <- filterByClass(tokenized.obj, "stopword")
  } else {}
} else {}
# }

Run the code above in your browser using DataLab