generate_blocked_document_term_vectors: A function to generate and save blocks of document term vectors to coherently named files from a variety of inputs.

Description

A function to generate and save blocks of document term vectors to coherently named files from a variety of inputs.

Usage

generate_blocked_document_term_vectors(input, output_stem, data_directory,
  output_directory = NULL, block_size = 100, data_type = c("string",
  "term vector", "raw text", "csv", "ngrams"), ngram_type = NULL,
  tokenization_method = c("RegEx"), csv_separator = ",",
  csv_word_column = NULL, csv_count_column = NULL, csv_header = FALSE,
  keep_sequence = FALSE)

Arguments

input

A list of strings, term vectors, raw documents, or csv files you wish to turn into document term vectors.

output_stem

The the stem of the file name we wish to give each block of output document term vector list objects generated by this function.

data_directory

Argument specifying where the data is stored.

output_directory

Optional directory to store blocked document term vector output.

block_size

THe number of documents to group together in a ingle block of text to save. Defaults to 100.

data_type

The type of data provided to the function.

ngram_type

The type of ngram we wish to use to generate document term vectors. Can be one of ngrams "jk_filtered", "verb_filtered", "phrases", or any of "x_grams" where x is a number specifying the n_gram length. Can only be used with input generated by the ngrams() function.

tokenization_method

Currently not available.

csv_separator

Defaults to "," but can be set to "*backslash*t" for tab separated values.

csv_word_column

If you are providing one csv file per document, then you must specify the index of the column that contains the words. Defaults to NULL.

csv_count_column

For memory efficiency, you may want to store only the counts of unique words in csv files. If your data include counts, then you must specify the index of the column that contains the counts. Defaults to NULL.

csv_header

Logical indicating whether the csv files provided have a header. Defaults to FALSE.

keep_sequence

Logical indicating whether document term vectors should be condensed and counts (FALSE) or whether the full sequence should be maintained for storage (TRUE). Defaults to FALSE as this can be a much more memory efficient representation.

Value

Saves blocks of text to file.