detectRareWords: Looking up word frequencies

Description

This function checks, for each word in a text, how frequently it occurs in a given language. This is useful for eliminating rare words to make a text more accessible to an audience with limited vocabulary. htmlParse and xpathSApply from the XML package are used to process HTML files, if necessary. textToWords is a helper function that simply breaks down a character vector to a vector of words.

Usage

detectRareWords(textFile = NULL,
                wordFrequencyFile = "Dutch",
                output = c("file", "show", "return"),
                outputFile = NULL,
                wordCol = "Word", freqCol = "FREQlemma",
                textToWordsFunction = "textToWords",
                encoding = "ASCII",
                xPathSelector = "/text()",
                silent = FALSE)
textToWords(characterVector)

Arguments

textFile

If NULL, a dialog will be shown that enables users to select a file. If not NULL, this has to be either a filename or a character vector. An HTML file can be provided; this will be parsed using

wordFrequencyFile

The file with word frequencies to use. If 'Dutch' or 'Polish', files from the Center for Reading Research (http://crr.ugent.be/) are downloaded.

output

How to provide the output, as a character vector. If file, the filename to write to should be provided in outputFile. If show, the output is shown; and if return, the output is returned invisibly.

outputFile

The name of the file to store the output in.

wordCol

The name of the column in the wordFrequencyFile that contains the words.

freqCol

The name of the column in the wordFrequencyFile that contains the frequency with which each word occurs.

textToWordsFunction

The function to use to split a character vector, where each element contains one or more words, into a vector where each element is a word.

encoding

The encoding used to read and write files.

xPathSelector

If the file provided is an HTML file, xpathSApply is used to extract the content. xPathSelector specifies which content to extract (the default value extracts all text content).

silent

Whether to suppress detailed feedback about the process.

characterVector

A character vector, the elements of which are to be broken down into words.

Value

detectRareWords return a dataframe (invisibly) if output contains return. Otherwise, NULL is returned (invisibly), but the output is printed and/or written to a file depending on the value of output.

textToWords returns a vector of words.

Examples

Run this code

# NOT RUN {
detectRareWords(paste('Dit is een tekst om de',
                      'werking van de detectRareWords',
                      'functie te demonstreren.'),
                output='show');
# }

Run the code above in your browser using DataLab