LargeDataSetForText: Abstract class for large data sets containing raw texts

Description

This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.

Arguments

Value

Returns a new object of this class.

Super class

aifeducation::LargeDataSetBase -> LargeDataSetForText

Methods

Public methods

Inherited methods

Method `new()`

Method for creation of LargeDataSetForText instance. It can be initialized with init_data parameter if passed (Uses add_from_data.frame() method if init_data is data.frame).

Usage

LargeDataSetForText$new(init_data = NULL)

Arguments

init_data: Initial data.frame for dataset.

Returns

A new instance of this class initialized with init_data if passed.

Method `add_from_files_txt()`

Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:

bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.

The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.

Usage

LargeDataSetForText$add_from_files_txt(
  dir_path,
  batch_size = 500,
  log_file = NULL,
  log_write_interval = 2,
  log_top_value = 0,
  log_top_total = 1,
  log_top_message = NA,
  trace = TRUE
)

Arguments

dir_path: Path to the directory where the files are stored.

batch_size

int determining the number of files to process at once.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

trace

bool If TRUE information on the progress is printed to the console.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `add_from_files_pdf()`

Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:

bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.

The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.

Usage

LargeDataSetForText$add_from_files_pdf(
  dir_path,
  batch_size = 500,
  log_file = NULL,
  log_write_interval = 2,
  log_top_value = 0,
  log_top_total = 1,
  log_top_message = NA,
  trace = TRUE
)

Arguments

dir_path: Path to the directory where the files are stored.

batch_size

int determining the number of files to process at once.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

trace

bool If TRUE information on the progress is printed to the console.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `add_from_files_xlsx()`

Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.

Usage

LargeDataSetForText$add_from_files_xlsx(
  dir_path,
  trace = TRUE,
  id_column = "id",
  text_column = "text",
  bib_entry_column = "bib_entry",
  license_column = "license",
  url_license_column = "url_license",
  text_license_column = "text_license",
  url_source_column = "url_source",
  log_file = NULL,
  log_write_interval = 2,
  log_top_value = 0,
  log_top_total = 1,
  log_top_message = NA
)

Arguments

dir_path: Path to the directory where the files are stored.

trace

bool If TRUE prints information on the progress to the console.

id_column

string Name of the column storing the ids for the texts.

text_column

string Name of the column storing the raw text.

bib_entry_column

string Name of the column storing the bibliographic information of the texts.

license_column

string Name of the column storing information about the licenses.

url_license_column

string Name of the column storing information about the url to the license in the internet.

text_license_column

string Name of the column storing the license as text.

url_source_column

string Name of the column storing information about about the url to the source in the internet.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `add_from_data.frame()`

Method for adding raw texts from a data.frame

Usage

LargeDataSetForText$add_from_data.frame(data_frame)

Arguments

data_frame: Object of class data.frame with at least the following columns "id","text","bib_entry", "license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs. If the other columns are not present in the data.frame they are added with empty values(NA). Additional columns are dropped.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `get_private()`

Method for requesting all private fields and methods. Used for loading and updating an object.

Usage

LargeDataSetForText$get_private()

Returns

Returns a list with all private fields and methods.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

LargeDataSetForText$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Description

Arguments

Value

Super class

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method add_from_files_txt()

Usage

Arguments

Returns

Method add_from_files_pdf()

Usage

Arguments

Returns

Method add_from_files_xlsx()

Usage

Arguments

Returns

Method add_from_data.frame()

Usage

Arguments

Returns

Method get_private()

Usage

Returns

Method clone()

Usage

Arguments

See Also

Method `new()`

Method `add_from_files_txt()`

Method `add_from_files_pdf()`

Method `add_from_files_xlsx()`

Method `add_from_data.frame()`

Method `get_private()`

Method `clone()`