This function checks, for each word in a text, how frequently it occurs in a given language. This is useful for eliminating rare words to make a text more accessible to an audience with limited vocabulary. htmlParse
and xpathSApply
from the XML
package are used to process HTML files, if necessary. textToWords
is a helper function that simply breaks down a character vector to a vector of words.
detectRareWords(textFile = NULL,
wordFrequencyFile = "Dutch",
output = c("file", "show", "return"),
outputFile = NULL,
wordCol = "Word", freqCol = "FREQlemma",
textToWordsFunction = "textToWords",
encoding = "ASCII",
xPathSelector = "/text()",
silent = FALSE)
textToWords(characterVector)
If NULL, a dialog will be shown that enables users to select a file. If not NULL, this has to be either a filename or a character vector. An HTML file can be provided; this will be parsed using
The file with word frequencies to use. If 'Dutch' or 'Polish', files from the Center for Reading Research (http://crr.ugent.be/) are downloaded.
How to provide the output, as a character vector. If file
, the filename to write to should be provided in outputFile
. If show
, the output is shown; and if return
, the output is returned invisibly.
The name of the file to store the output in.
The name of the column in the wordFrequencyFile
that contains the words.
The name of the column in the wordFrequencyFile
that contains the frequency with which each word occurs.
The function to use to split a character vector, where each element contains one or more words, into a vector where each element is a word.
The encoding used to read and write files.
If the file provided is an HTML file, xpathSApply
is used to extract the content. xPathSelector
specifies which content to extract (the default value extracts all text content).
Whether to suppress detailed feedback about the process.
A character vector, the elements of which are to be broken down into words.
detectRareWords
return a dataframe (invisibly) if output
contains return
. Otherwise, NULL is returned (invisibly), but the output is printed and/or written to a file depending on the value of output
.
textToWords
returns a vector of words.
# NOT RUN {
detectRareWords(paste('Dit is een tekst om de',
'werking van de detectRareWords',
'functie te demonstreren.'),
output='show');
# }
Run the code above in your browser using DataLab