A function to generate and save blocks of document term vectors to coherently named files from a variety of inputs.
generate_blocked_document_term_vectors(input, output_stem, data_directory,
output_directory = NULL, block_size = 100, data_type = c("string",
"term vector", "raw text", "csv", "ngrams"), ngram_type = NULL,
tokenization_method = c("RegEx"), csv_separator = ",",
csv_word_column = NULL, csv_count_column = NULL, csv_header = FALSE,
keep_sequence = FALSE)
A list of strings, term vectors, raw documents, or csv files you wish to turn into document term vectors.
The the stem of the file name we wish to give each block of output document term vector list objects generated by this function.
Argument specifying where the data is stored.
Optional directory to store blocked document term vector output.
THe number of documents to group together in a ingle block of text to save. Defaults to 100.
The type of data provided to the function.
The type of ngram we wish to use to generate document term vectors. Can be one of ngrams "jk_filtered", "verb_filtered", "phrases", or any of "x_grams" where x is a number specifying the n_gram length. Can only be used with input generated by the ngrams() function.
Currently not available.
Defaults to "," but can be set to "*backslash*t" for tab separated values.
If you are providing one csv file per document, then you must specify the index of the column that contains the words. Defaults to NULL.
For memory efficiency, you may want to store only the counts of unique words in csv files. If your data include counts, then you must specify the index of the column that contains the counts. Defaults to NULL.
Logical indicating whether the csv files provided have a header. Defaults to FALSE.
Logical indicating whether document term vectors should be condensed and counts (FALSE) or whether the full sequence should be maintained for storage (TRUE). Defaults to FALSE as this can be a much more memory efficient representation.
Saves blocks of text to file.