This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
Returns a new object of this class.
aifeducation::LargeDataSetBase
-> LargeDataSetForText
Inherited methods
aifeducation::LargeDataSetBase$get_all_fields()
aifeducation::LargeDataSetBase$get_colnames()
aifeducation::LargeDataSetBase$get_dataset()
aifeducation::LargeDataSetBase$get_ids()
aifeducation::LargeDataSetBase$load()
aifeducation::LargeDataSetBase$load_from_disk()
aifeducation::LargeDataSetBase$n_cols()
aifeducation::LargeDataSetBase$n_rows()
aifeducation::LargeDataSetBase$reduce_to_unique_ids()
aifeducation::LargeDataSetBase$save()
aifeducation::LargeDataSetBase$select()
new()
Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame()
method if init_data
is data.frame
).
LargeDataSetForText$new(init_data = NULL)
init_data
Initial data.frame
for dataset.
A new instance of this class initialized with init_data
if passed.
add_from_files_txt()
Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_txt(
dir_path,
batch_size = 500,
log_file = NULL,
log_write_interval = 2,
log_top_value = 0,
log_top_total = 1,
log_top_message = NA,
trace = TRUE
)
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.
log_file
string
Path to the file where the log should be saved. If no logging is desired set this
argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file
is not NULL
.
log_top_value
int
indicating the current iteration of the process.
log_top_total
int
determining the maximal number of iterations.
log_top_message
string
providing additional information of the process.
trace
bool
If TRUE
information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_pdf()
Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_pdf(
dir_path,
batch_size = 500,
log_file = NULL,
log_write_interval = 2,
log_top_value = 0,
log_top_total = 1,
log_top_message = NA,
trace = TRUE
)
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.
log_file
string
Path to the file where the log should be saved. If no logging is desired set this
argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file
is not NULL
.
log_top_value
int
indicating the current iteration of the process.
log_top_total
int
determining the maximal number of iterations.
log_top_message
string
providing additional information of the process.
trace
bool
If TRUE
information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_xlsx()
Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
LargeDataSetForText$add_from_files_xlsx(
dir_path,
trace = TRUE,
id_column = "id",
text_column = "text",
bib_entry_column = "bib_entry",
license_column = "license",
url_license_column = "url_license",
text_license_column = "text_license",
url_source_column = "url_source",
log_file = NULL,
log_write_interval = 2,
log_top_value = 0,
log_top_total = 1,
log_top_message = NA
)
dir_path
Path to the directory where the files are stored.
trace
bool
If TRUE
prints information on the progress to the console.
id_column
string
Name of the column storing the ids for the texts.
text_column
string
Name of the column storing the raw text.
bib_entry_column
string
Name of the column storing the bibliographic information of the texts.
license_column
string
Name of the column storing information about the licenses.
url_license_column
string
Name of the column storing information about the url to the license in the
internet.
text_license_column
string
Name of the column storing the license as text.
url_source_column
string
Name of the column storing information about about the url to the source in the
internet.
log_file
string
Path to the file where the log should be saved. If no logging is desired set this
argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file
is not NULL
.
log_top_value
int
indicating the current iteration of the process.
log_top_total
int
determining the maximal number of iterations.
log_top_message
string
providing additional information of the process.
The method does not return anything. It adds new raw texts to the data set.
add_from_data.frame()
Method for adding raw texts from a data.frame
LargeDataSetForText$add_from_data.frame(data_frame)
data_frame
Object of class data.frame
with at least the following columns "id","text","bib_entry",
"license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs.
If the other columns are not present in the data.frame
they are added with empty values(NA
).
Additional columns are dropped.
The method does not return anything. It adds new raw texts to the data set.
get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
LargeDataSetForText$get_private()
Returns a list
with all private fields and methods.
clone()
The objects of this class are cloneable with this method.
LargeDataSetForText$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Data Management:
DataManagerClassifier
,
EmbeddedText
,
LargeDataSetForTextEmbeddings