tokenize: A simple tokenizer

Description

This tokenizer can be used to try replace TreeTagger. Its results are not as detailed when it comes to word classes, and no lemmatization is done. However, for most cases this should suffice.

Usage

tokenize(txt, format = "file", fileEncoding = NULL,
    split = "[[:space:]]", ign.comp = "-",
    heuristics = "abbr",
    heur.fix = list(pre = c("’", "'"), suf = c("’", "'")),
    abbrev = NULL, tag = TRUE, lang = "kRp.env",
    sentc.end = c(".", "!", "?", ";", ":"),
    detect = c(parag = FALSE, hline = FALSE),
    clean.raw = NULL, perl = FALSE, stopwords = NULL,
    stemmer = NULL)

Arguments

txt

Either an open connection, the path to directory with txt files to read and tokenize, or a vector object already holding the text corpus.

format

Either "file" or "obj", depending on whether you want to scan files or analyze the given object.

fileEncoding

A character string naming the encoding of all files.

split

A regular expression to define the basic split method. Should only need refinement for languages that don't separate words by space.

ign.comp

A character vector defining punctuation which might be used in composita that should not be split.

heuristics

A vector to indicate if the tokenizer should use some heuristics. Can be none, one or several of the following:

"abbr"

{Assume that "letter-dot-letter-dot" combinations are abbreviations and leave them intact.}

Value

If tag=FALSE, a character vector with the tokenized text. If tag=TRUE, returns an object of class kRp.tagged-class.

item

heur.fix
abbrev
tag
lang
sentc.end
detect
clean.raw
perl
stopwords
stemmer

code

Snowball

Details

tokenize can try to guess what's a headline and where a paragraph was inserted (via the detect parameter). A headline is assumed if a line of text without sentence ending punctuation is found, a paragraph if two blocks of text are separated by space. This will add extra tags into the text: "" (headline starts), "" (headline ends) and "" (paragraph), respectively. This can be useful in two cases: "" will be treated like a sentence ending, which gives you more control for automatic analyses. And adding to that, kRp.text.paste can replace these tags, which probably preserves more of the original layout.

Examples

Run this code

tokenized.obj <- tokenize("~/mydata/corpora/russian_corpus/")

## character manipulation
# this is useful if you know of problematic characters in your
# raw text files, but don't want to touch them directly. you
# don't have to, as you can substitute them, even using regular
# expressions. a simple example: replace all single quotes by
# double quotes througout the text:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   clean.raw=list("'"="\""))
# now replace all occurrances of the letter A followed
# by two digits with the letter B, followed by the same
# two digits:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   clean.raw=list("(A)([[:digit:]]{2})"="B\\2"),
   perl=TRUE)

## enabling stopword detection and stemming
# if you also installed the packages tm and Snowball,
# you can use some of their features with koRpus:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   stopwords=tm::stopwords("en"),
   stemmer=Snowball::SnowballStemmer)
# removing all stopwords now is simple:
tokenized.noStopWords <- kRp.filter.wclass(tokenized.obj, "stopword")

Run the code above in your browser using DataLab