This object stores text embeddings which are usually produced by an object of class TextEmbeddingModel. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
LargeDataSetForTextEmbeddings are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class LargeDataSetForTextEmbeddings serve as input for objects of class TEClassifierRegular, TEClassifierProtoNet, and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embedding generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.
Returns a new object of this class.
aifeducation::LargeDataSetBase
-> LargeDataSetForTextEmbeddings
Inherited methods
aifeducation::LargeDataSetBase$get_all_fields()
aifeducation::LargeDataSetBase$get_colnames()
aifeducation::LargeDataSetBase$get_dataset()
aifeducation::LargeDataSetBase$get_ids()
aifeducation::LargeDataSetBase$load()
aifeducation::LargeDataSetBase$n_cols()
aifeducation::LargeDataSetBase$n_rows()
aifeducation::LargeDataSetBase$reduce_to_unique_ids()
aifeducation::LargeDataSetBase$save()
aifeducation::LargeDataSetBase$select()
configure()
Creates a new object representing text embeddings.
LargeDataSetForTextEmbeddings$configure(
model_name = NA,
model_label = NA,
model_date = NA,
model_method = NA,
model_version = NA,
model_language = NA,
param_seq_length = NA,
param_chunks = NULL,
param_features = NULL,
param_overlap = NULL,
param_emb_layer_min = NULL,
param_emb_layer_max = NULL,
param_emb_pool_type = NULL,
param_aggregation = NULL
)
model_name
string
Name of the model that generates this embedding.
model_label
string
Label of the model that generates this embedding.
model_date
string
Date when the embedding generating model was created.
model_method
string
Method of the underlying embedding model.
model_version
string
Version of the model that generated this embedding.
model_language
string
Language of the model that generated this embedding.
param_seq_length
int
Maximum number of tokens that processes the generating model for a chunk.
param_chunks
int
Maximum number of chunks which are supported by the generating model.
param_features
int
Number of dimensions of the text embeddings.
param_overlap
int
Number of tokens that were added at the beginning of the sequence for the next chunk
by this model.
param_emb_layer_min
int
or string
determining the first layer to be included in the creation of
embeddings.
param_emb_layer_max
int
or string
determining the last layer to be included in the creation of
embeddings.
param_emb_pool_type
string
determining the method for pooling the token embeddings within each layer.
param_aggregation
string
Aggregation method of the hidden states. Deprecated. Only included for backward
compatibility.
The method returns a new object of this class.
is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE
.
LargeDataSetForTextEmbeddings$is_configured()
bool
TRUE
if the model is fully configured. FALSE
if not.
get_text_embedding_model_name()
Method for requesting the name (unique id) of the underlying text embedding model.
LargeDataSetForTextEmbeddings$get_text_embedding_model_name()
Returns a string
describing name of the text embedding model.
get_model_info()
Method for retrieving information about the model that generated this embedding.
LargeDataSetForTextEmbeddings$get_model_info()
list
containing all saved information about the underlying text embedding model.
load_from_disk()
loads an object of class LargeDataSetForTextEmbeddings from disk and updates the object to the current version of the package.
LargeDataSetForTextEmbeddings$load_from_disk(dir_path)
dir_path
Path where the data set set is stored.
Method does not return anything. It loads an object from disk.
get_model_label()
Method for retrieving the label of the model that generated this embedding.
LargeDataSetForTextEmbeddings$get_model_label()
string
Label of the corresponding text embedding model
add_feature_extractor_info()
Method setting information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a TEFeatureExtractor was applied.
LargeDataSetForTextEmbeddings$add_feature_extractor_info(
model_name,
model_label = NA,
features = NA,
method = NA,
noise_factor = NA,
optimizer = NA
)
model_name
string
Name of the underlying TextEmbeddingModel.
model_label
string
Label of the underlying TextEmbeddingModel.
features
int
Number of dimension (features) for the compressed text embeddings.
method
string
Method that the TEFeatureExtractor applies for genereating the compressed text
embeddings.
noise_factor
double
Noise factor of the TEFeatureExtractor.
optimizer
string
Optimizer used during training the TEFeatureExtractor.
Method does nothing return. It sets information on a TEFeatureExtractor.
get_feature_extractor_info()
Method for receiving information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings.
LargeDataSetForTextEmbeddings$get_feature_extractor_info()
Returns a list
with information on the TEFeatureExtractor. If no TEFeatureExtractor was used it
returns NULL
.
is_compressed()
Checks if the text embedding were reduced by a TEFeatureExtractor.
LargeDataSetForTextEmbeddings$is_compressed()
Returns TRUE
if the number of dimensions was reduced by a TEFeatureExtractor. If not return FALSE
.
get_times()
Number of chunks/times of the text embeddings.
LargeDataSetForTextEmbeddings$get_times()
Returns an int
describing the number of chunks/times of the text embeddings.
get_features()
Number of actual features/dimensions of the text embeddings.In the case a TEFeatureExtractor was
used the number of features is smaller as the original number of features. To receive the original number of
features (the number of features before applying a TEFeatureExtractor) you can use the method
get_original_features
of this class.
LargeDataSetForTextEmbeddings$get_features()
Returns an int
describing the number of features/dimensions of the text embeddings.
get_original_features()
Number of original features/dimensions of the text embeddings.
LargeDataSetForTextEmbeddings$get_original_features()
Returns an int
describing the number of features/dimensions if no TEFeatureExtractor) is used or
before a TEFeatureExtractor) is applied.
add_embeddings_from_array()
Method for adding new data to the data set from an array
. Please note that the method does not
check if cases already exist in the data set. To reduce the data set to unique cases call the method
reduce_to_unique_ids
.
LargeDataSetForTextEmbeddings$add_embeddings_from_array(embedding_array)
embedding_array
array
containing the text embeddings.
The method does not return anything. It adds new data to the data set.
add_embeddings_from_EmbeddedText()
Method for adding new data to the data set from an EmbeddedText. Please note that the method does
not check if cases already exist in the data set. To reduce the data set to unique cases call the method
reduce_to_unique_ids
.
LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText(EmbeddedText)
EmbeddedText
Object of class EmbeddedText.
The method does not return anything. It adds new data to the data set.
add_embeddings_from_LargeDataSetForTextEmbeddings()
Method for adding new data to the data set from an LargeDataSetForTextEmbeddings. Please note that
the method does not check if cases already exist in the data set. To reduce the data set to unique cases call
the method reduce_to_unique_ids
.
LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings(
dataset
)
dataset
Object of class LargeDataSetForTextEmbeddings.
The method does not return anything. It adds new data to the data set.
convert_to_EmbeddedText()
Method for converting this object to an object of class EmbeddedText.
Attention This object uses memory mapping to allow the usage of data sets that do not fit into memory. By calling this method the data set will be loaded and stored into memory/RAM. This may lead to an out-of-memory error.
LargeDataSetForTextEmbeddings$convert_to_EmbeddedText()
LargeDataSetForTextEmbeddings an object of class EmbeddedText which is stored in the memory/RAM.
clone()
The objects of this class are cloneable with this method.
LargeDataSetForTextEmbeddings$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Data Management:
DataManagerClassifier
,
EmbeddedText
,
LargeDataSetForText