make_lemma_dictionary: Generate a Lemma Dictionary

Description

Given a set of text strings, the function generates a dictionary of lemmas corresponding to words that are not in base form.

Usage

make_lemma_dictionary(..., engine = "hunspell", path = NULL,
  lang = switch(engine, hunspell = {     "en_US" }, treetagger = {     "en" },
  lexicon = {     NULL }, stop("engine not found")))

Arguments

engine

One of: "hunspell", "treetragger" or "lexicon". The lexicon and hunspell choices use the lexicon and hunspell packages, which may be faster than TreeTagger, have the tooling available without installing external tools but are likely less accurate. TreeTagger is likely more accurate but requires installing the TreeTagger program (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger.

path

Path to the TreeTagger program if engine = "treetagger". If NULL textstem will attempt to locate the location of TreeTagger.

lang

A character string naming the language to be used in koRpus (treetagger) or hunspell. The default language is 'en' for koRpus (treetagger) and 'en_US' for hunspell. See ?koRpus::treetag or ?hunspell::dictionary for details. Note that for koRpus::treetag lang is passed to both lang and prest in the TT.options argument.

…

A vector of texts to generate lemmas for.

Value

Returns a two column data.frame with tokens and corresponding lemmas.

Examples

Run this code

# NOT RUN {
x <- c('the dirtier dog has eaten the pies',
    'that shameful pooch is tricky and sneaky',
    "He opened and then reopened the food bag",
    'There are skies of blue and red roses too!'
)
make_lemma_dictionary(x)
# }
# NOT RUN {
make_lemma_dictionary(x, engine = 'treetagger')
# }

Run the code above in your browser using DataLab