Abstract class for neural nets with 'keras'/'tensorflow' and ' pytorch'.
Objects of this class are used for assigning texts to classes/categories. For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.
The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.
The factor
contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can
be used for pseudo labeling.
For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.
aifeducation::AIFEBaseModel
-> TEClassifierRegular
feature_extractor
('list()')
List for storing information and objects about the feature_extractor.
reliability
('list()')
List for storing central reliability measures of the last training.
reliability$test_metric
: Array containing the reliability measures for the test data for
every fold and step (in case of pseudo-labeling).
reliability$test_metric_mean
: Array containing the reliability measures for the test data.
The values represent the mean values for every fold.
reliability$raw_iota_objects
: List containing all iota_object generated with the package iotarelr
for every fold at the end of the last training for the test data.
reliability$raw_iota_objects$iota_objects_end
: List of objects with class iotarelr_iota2
containing the
estimated iota reliability of the second generation for the final model for every fold for the test data.
reliability$raw_iota_objects$iota_objects_end_free
: List of objects with class iotarelr_iota2
containing
the estimated iota reliability of the second generation for the final model for every fold for the test data.
Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the
assumption of weak superiority.
reliability$iota_object_end
: Object of class iotarelr_iota2
as a mean of the individual objects
for every fold for the test data.
reliability$iota_object_end_free
: Object of class iotarelr_iota2
as a mean of the individual objects
for every fold. Please note that the model is estimated without forcing the Assignment Error Matrix to be in
line with the assumption of weak superiority.
reliability$standard_measures_end
: Object of class list
containing the final measures for precision,
recall, and f1 for every fold.
reliability$standard_measures_mean
: matrix
containing the mean measures for precision, recall, and f1.
Inherited methods
aifeducation::AIFEBaseModel$count_parameter()
aifeducation::AIFEBaseModel$get_all_fields()
aifeducation::AIFEBaseModel$get_documentation_license()
aifeducation::AIFEBaseModel$get_ml_framework()
aifeducation::AIFEBaseModel$get_model_description()
aifeducation::AIFEBaseModel$get_model_info()
aifeducation::AIFEBaseModel$get_model_license()
aifeducation::AIFEBaseModel$get_package_versions()
aifeducation::AIFEBaseModel$get_private()
aifeducation::AIFEBaseModel$get_publication_info()
aifeducation::AIFEBaseModel$get_sustainability_data()
aifeducation::AIFEBaseModel$get_text_embedding_model()
aifeducation::AIFEBaseModel$get_text_embedding_model_name()
aifeducation::AIFEBaseModel$is_configured()
aifeducation::AIFEBaseModel$load()
aifeducation::AIFEBaseModel$set_documentation_license()
aifeducation::AIFEBaseModel$set_model_description()
aifeducation::AIFEBaseModel$set_model_license()
aifeducation::AIFEBaseModel$set_publication_info()
configure()
Creating a new instance of this class.
TEClassifierRegular$configure(
ml_framework = "pytorch",
name = NULL,
label = NULL,
text_embeddings = NULL,
feature_extractor = NULL,
target_levels = NULL,
dense_size = 4,
dense_layers = 0,
rec_size = 4,
rec_layers = 2,
rec_type = "gru",
rec_bidirectional = FALSE,
self_attention_heads = 0,
intermediate_size = NULL,
attention_type = "fourier",
add_pos_embedding = TRUE,
rec_dropout = 0.1,
repeat_encoder = 1,
dense_dropout = 0.4,
recurrent_dropout = 0.4,
encoder_dropout = 0.1,
optimizer = "adam"
)
ml_framework
string
Framework to use for training and inference. ml_framework="tensorflow"
for
'tensorflow' and ml_framework="pytorch"
for 'pytorch'
name
string
Name of the new classifier. Please refer to common name conventions. Free text can be used
with parameter label
.
label
string
Label for the new classifier. Here you can use free text.
text_embeddings
An object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractor
Object of class TEFeatureExtractor which should be used in order to reduce the number
of dimensions of the text embeddings. If no feature extractor should be applied set NULL
.
target_levels
vector
containing the levels (categories or classes) within the target data. Please not
that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels
indicating a higher category/class. For nominal data the order does not matter.
dense_size
int
Number of neurons for each dense layer.
dense_layers
int
Number of dense layers.
rec_size
int
Number of neurons for each recurrent layer.
rec_layers
int
Number of recurrent layers.
rec_type
string
Type of the recurrent layers. rec_type="gru"
for Gated Recurrent Unit and
rec_type="lstm"
for Long Short-Term Memory.
rec_bidirectional
bool
If TRUE
a bidirectional version of the recurrent layers is used.
self_attention_heads
int
determining the number of attention heads for a self-attention layer. Only
relevant if attention_type="multihead"
intermediate_size
int
determining the size of the projection layer within a each transformer encoder.
attention_type
string
Choose the relevant attention type. Possible values are fourier
and multihead
. Please note
that you may see different values for a case for different input orders if you choose fourier
on linux.
add_pos_embedding
bool
TRUE
if positional embedding should be used.
rec_dropout
int
ranging between 0 and lower 1, determining the dropout between bidirectional recurrent
layers.
repeat_encoder
int
determining how many times the encoder should be added to the network.
dense_dropout
int
ranging between 0 and lower 1, determining the dropout between dense layers.
recurrent_dropout
int
ranging between 0 and lower 1, determining the recurrent dropout for each
recurrent layer. Only relevant for keras models.
encoder_dropout
int
ranging between 0 and lower 1, determining the dropout for the dense projection
within the encoder layers.
optimizer
string
"adam"
or "rmsprop"
.
Returns an object of class TEClassifierRegular which is ready for training.
train()
Method for training a neural net.
Training includes a routine for early stopping. In the case that loss<0.0001 and Accuracy=1.00 and Average Iota=1.00 training stops. The history uses the values of the last trained epoch for the remaining epochs.
After training the model with the best values for Average Iota, Accuracy, and Loss on the validation data set is used as the final model.
TEClassifierRegular$train(
data_embeddings,
data_targets,
data_folds = 5,
data_val_size = 0.25,
balance_class_weights = TRUE,
balance_sequence_length = TRUE,
use_sc = TRUE,
sc_method = "dbsmote",
sc_min_k = 1,
sc_max_k = 10,
use_pl = TRUE,
pl_max_steps = 3,
pl_max = 1,
pl_anchor = 1,
pl_min = 0,
sustain_track = TRUE,
sustain_iso_code = NULL,
sustain_region = NULL,
sustain_interval = 15,
epochs = 40,
batch_size = 32,
dir_checkpoint,
trace = TRUE,
ml_trace = 1,
log_dir = NULL,
log_write_interval = 10,
n_cores = auto_n_cores()
)
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targets
factor
containing the labels for cases stored in data_embeddings
. Factor must be named
and has to use the same names used in data_embeddings
.
data_folds
int
determining the number of cross-fold samples.
data_val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be
used for the validation sample during the estimation of the model. The remaining cases are part of the training
data.
balance_class_weights
bool
If TRUE
class weights are generated based on the frequencies of the
training data with the method Inverse Class Frequency'. If FALSE
each class has the weight 1.
balance_sequence_length
bool
If TRUE
sample weights are generated for the length of sequences based on
the frequencies of the training data with the method Inverse Class Frequency'. If FALSE
each sequences length
has the weight 1.
use_sc
bool
TRUE
if the estimation should integrate synthetic cases. FALSE
if not.
sc_method
vector
containing the method for generating synthetic cases. Possible are sc_method="adas"
,
sc_method="smote"
, and sc_method="dbsmote"
.
sc_min_k
int
determining the minimal number of k which is used for creating synthetic units.
sc_max_k
int
determining the maximal number of k which is used for creating synthetic units.
use_pl
bool
TRUE
if the estimation should integrate pseudo-labeling. FALSE
if not.
pl_max_steps
int
determining the maximum number of steps during pseudo-labeling.
pl_max
double
between 0 and 1, setting the maximal level of confidence for considering a case for
pseudo-labeling.
pl_anchor
double
between 0 and 1 indicating the reference point for sorting the new cases of every
label. See notes for more details.
pl_min
double
between 0 and 1, setting the minimal level of confidence for considering a case for
pseudo-labeling.
sustain_track
bool
If TRUE
energy consumption is tracked during training via the python library
'codecarbon'.
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if
sustainability should be tracked. A list can be found on Wikipedia:
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region
Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_interval
int
Interval in seconds for measuring power usage.
epochs
int
Number of training epochs.
batch_size
int
Size of the batches for training.
dir_checkpoint
string
Path to the directory where the checkpoint during training should be saved. If the
directory does not exist, it is created.
trace
bool
TRUE
, if information about the estimation phase should be printed to the console.
ml_trace
int
ml_trace=0
does not print any information about the training process from pytorch on the
console.
log_dir
string
Path to the directory where the log files should be saved. If no logging is desired set
this argument to NULL
.
log_write_interval
int
Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_dir
is not NULL
.
n_cores
int
Number of cores which should be used during the calculation of synthetic cases. Only relevant if
use_sc=TRUE
.
sc_max_k
: All values from sc_min_k up to sc_max_k are successively used. If
the number of sc_max_k is too high, the value is reduced to a number that allows the calculating of synthetic
units.
pl_anchor
: With the help of this value, the new cases are sorted. For
this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.
Function does not return a value. It changes the object into a trained classifier.
predict()
Method for predicting new data with a trained neural net.
TEClassifierRegular$predict(newdata, batch_size = 32, ml_trace = 1)
newdata
Object of class TextEmbeddingModel or LargeDataSetForTextEmbeddings for which predictions
should be made. In addition, this method allows to use objects of class array
and
datasets.arrow_dataset.Dataset
. However, these should be used only by developers.
batch_size
int
Size of batches.
ml_trace
int
ml_trace=0
does not print any information on the process from the machine learning
framework.
Returns a data.frame
containing the predictions and the probabilities of the different labels for each
case.
check_embedding_model()
Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the classifier.
TEClassifierRegular$check_embedding_model(
text_embeddings,
require_compressed = FALSE
)
text_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
require_compressed
TRUE
if a compressed version of the embeddings are necessary. Compressed embeddings
are created by an object of class TEFeatureExtractor.
TRUE
if the underlying TextEmbeddingModel is the same. FALSE
if the models differ.
check_feature_extractor_object_type()
Method for checking an object of class TEFeatureExtractor.
TEClassifierRegular$check_feature_extractor_object_type(feature_extractor)
feature_extractor
Object of class TEFeatureExtractor
This method does nothing returns. It raises an error if
the object is NULL
the object does not rely on the same machine learning framework as the classifier
the object is not trained.
requires_compression()
Method for checking if provided text embeddings must be compressed via a TEFeatureExtractor before processing.
TEClassifierRegular$requires_compression(text_embeddings)
text_embeddings
Object of class EmbeddedText, LargeDataSetForTextEmbeddings, array
or
datasets.arrow_dataset.Dataset
.
Return TRUE
if a compression is necessary and FALSE
if not.
dir_path
string
Path of the directory where the model should be saved.
folder_name
string
Name of the folder that should be created within the directory.
Function does not return a value. It saves the model to disk.
load_from_disk()
loads an object from disk and updates the object to the current version of the package.
TEClassifierRegular$load_from_disk(dir_path)
dir_path
Path where the object set is stored.
Method does not return anything. It loads an object from disk.
clone()
The objects of this class are cloneable with this method.
TEClassifierRegular$clone(deep = FALSE)
deep
Whether to make a deep clone.
Other Classification:
TEClassifierProtoNet