Final data preparation before ML algorithms. Function provides final data set and highlights of the data preparation
autoDataprep(
data,
target = NULL,
missimpute = "default",
auto_mar = FALSE,
mar_object = NULL,
dummyvar = TRUE,
char_var_limit = 12,
aucv = 0.02,
corr = 0.99,
outlier_flag = FALSE,
interaction_var = FALSE,
frequent_var = FALSE,
uid = NULL,
onlykeep = NULL,
drop = NULL,
verbose = FALSE
)
list output contains below objects
complete_data
complete dataset including new derived features based on the functional understanding of the dataset
master_data
filtered dataset based on the input parameters
final_var_list
list of master variables
auc_var
list of auc variables
cor_var
list of correlation variables
overall_var
all variables in the dataset
zerovariance
variables with zero variance in the dataset
[data.frame | Required] dataframe or data.table
[integer | Required] dependent variable (binary or multiclass)
[text | Optional] missing value imputation using mlr misimpute function. Please refer to the "details" section to know more
[character | Optional] identify any missing variable which are completely missing at random or not (default FALSE). If TRUE this will call autoMAR()
[character | Optional] object created from autoMAR function
[logical | Optional] categorical feature engineering i.e. one hot encoding (default is TRUE)
[integer | Optional] default limit is 12 for a dummy variable preparation. e.g. if gender variable has two different value "M" and "F", then gender has 2 levels
[integer | Optional] cut off value for AUC based variable selection
[integer | Optional] cut off value for correlation based variable selection
[logical | Optional] to add outlier features (default is FALSE)
[logical | Optional] bulk interactions transformer for numerical features
[logical | Optional] frequent transformer for categorical features
[character | Optional] unique identifier column if any to keep in the final data set
[character | Optional] only consider selected variables for data preparation
[character | Optional] exclude variables from the dataset
[logical | Optional] display executions steps on console(default is FALSE)
Missing imputation using impute function from MLR
MLR package have a appropriate way to impute missing value using multiple methods. #'
mean value for integer variable
median value for numeric variable
mode value for character or factor variable
optional: You might be interested to impute missing variable using ML method. List of algorithms will be handle missing variables in MLR package listLearners("classif", check.packages = TRUE, properties = "missings")[c("class", "package")]
Feature engineering
missing not completely at random variable using autoMAR function
date transfomer like year, month, quarter, week
frequent transformer counts each categorical value in the dataset
interaction transformer using multiplication
one hot dummy coding for categorical value
outlier flag and capping variable for numerical value
Feature reduction
zero variance using nearZeroVar caret function
pearson's correlation value
auc with target variable
impute
#Auto data prep
traindata <- autoDataprep(heart, target = "target_var", missimpute = "default",
dummyvar = TRUE, aucv = 0.02, corr = 0.98, outlier_flag = TRUE,
interaction_var = TRUE, frequent_var = TRUE)
train <- traindata$master_data
Run the code above in your browser using DataLab