count_ngrams: An experimental function to efficiently generate a vocabulary in parallel from output produced by the ngrams() function. Cores > 1 will only work for users with GNU coreutils > 8.13 as the sort --parallel option is used. If you have an older version use cores = 1.

Description

An experimental function to efficiently generate a vocabulary in parallel from output produced by the ngrams() function. Cores > 1 will only work for users with GNU coreutils > 8.13 as the sort --parallel option is used. If you have an older version use cores = 1.

Usage

count_ngrams(ngrams = NULL, input_directory = NULL, file_list = NULL,
  combine_ngrams = FALSE, cores = 2, mac_brew = FALSE)

Arguments

ngrams

An optional list object output by the ngrams() function.

input_directory

An optional input directory where blocked output from th ngrams() function is stored as .Rdata files.

file_list

An optional vector of file names to be used. Useful if you only want to work on a subset of the input.

combine_ngrams

Logical indicating whether simple ngrams should be combined together when forming the vocabulary. If FALSE, then separate vocabularies will be generated for each ngram length. Defaults to FALSE.

cores

The number of cores to be used for parallelization.

mac_brew

An option to use alternate versions of shell commands that are compatible with GNU coretools as installed via "brew install coretools". Simple adds a "g" infront of commands.

Value

Returns a list object with the vocabulary (sorted by frequency) and and word counts.