training_model: Training model

Description

training_model Model builder

Usage

training_model(model_name = "mymodel", dat_train, dat_test = NULL,
  target = NULL, occur_time = NULL, obs_id = NULL, x_list = NULL,
  ex_cols = NULL, pos_flag = NULL, prop = 0.7, preproc = TRUE,
  miss_values = NULL, outlier_proc = TRUE, missing_proc = TRUE,
  default_miss = FALSE, feature_filter = list(filter = c("IV", "PSI",
  "COR", "XGB"), iv_cp = 0.02, psi_cp = 0.1, xgb_cp = 0, cv_folds = 1,
  hopper = FALSE), algorithm = list("LR", "XGB"),
  LR.params = lr_params(), XGB.params = xgb_params(),
  GBM.params = gbm_params(), RF.params = rf_params(),
  breaks_list = NULL, parallel = FALSE, cores_num = NULL,
  save_pmml = FALSE, plot_show = FALSE, model_path = tempdir(),
  seed = 46, ...)

Arguments

model_name

A string, name of the project. Default is "mymodel"

dat_train

A data.frame with independent variables and target variable.

dat_test

A data.frame of test data. Default is NULL.

target

The name of target variable.

occur_time

The name of the variable that represents the time at which each observation takes place.Default is NULL.

obs_id

The name of ID of observations or key variable of data. Default is NULL.

x_list

Names of independent variables. Default is NULL.

ex_cols

Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL.

pos_flag

The value of positive class of target variable, default: "1".

prop

Percentage of train-data after the partition. Default: 0.7.

preproc

Logical. Preprocess data. Default is TRUE

miss_values

Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "Unknown".

outlier_proc

Logical. If TRUE, Outliers processing using Kmeans and Local Outlier Factor. Default is TRUE

missing_proc

Logical. If TRUE, missing value analysis and process missing value by knn imputation or central impulation or random imputation. Default is TRUE

default_miss

Logical. If TRUE, assigning the missing values to -1 or "Unknown", otherwise, processing the missing values according to the results of missing analysis. See details at: process_nas

feature_filter

Parameters for selecting important and stable features.See details at: feature_select_wrapper

algorithm

Algorithms for training a model. list("LR", "XGB", "GBDT", "RF") are available.

LR.params

Parameters of logistic regression & scorecard. See details at : lr_params.

tree_control the list of parameters to control cutting initial breaks by decision tree. See details at: get_tree_breaks
bins_control the list of parameters to control merging initial breaks. See details at: select_best_breaks,select_best_class
best_lambda Metheds of best lanmbda stardards using to filter variables by LASSO.There are four methods: ("lambda.min", "lambda.1se", "lambda.05se" , "lambda.sim_sign") . Default is "lambda.sim_sign". See details at: get_best_lambda
obsweight An optional vector of 'prior weights' to be used in the fitting process. Should be NULL or a numeric vector. If you oversample or cluster diffrent datasets to training the LR model, you need to set this parameter to ensure that the probability of logistic regression output is the same as that before oversampling or segmentation. e.g.:There are 10,000 good obs and 500 bad obs before oversampling or under-sampling, 5,000 good obs and 3,000 bad obs after oversampling. Then this parameter should be set to c(10000/5000, 500/3000). Default is NULL..
forced_inNames of forced input variables. Default is NULL.
sp_values Vaules will be in separate bins.e.g. list(-1, "Unknown") means that -1 & Unknown as special values.Default is NULL.
step_wise Logical, stepwise method. Default is TRUE.
score_card Logical, transfer woe to a standard scorecard. If TRUE, Output scorecard, and score prediction, otherwise output probability. Default is TRUE.
cor_p The maximum threshold of correlation.0 <= cor_p <=1; 0.5 to 0.8 usually work. Default: 0.7.
iv_i The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.01
psi_i The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1

XGB.params

Parameters of xgboost. See details at : xgb_params.

GBM.params

Parameters of GBM. See details at : gbm_params.

RF.params

Parameters of Random Forest. See details at : rf_params.

breaks_list

A table containing a list of splitting points for each independent variable. Default is NULL.

parallel

Default is FALSE

cores_num

The number of CPU cores to use.

save_pmml

Logical, save model in PMML format. Default is TRUE.

plot_show

Logical, show model performance in current graphic device. Default is FALSE.

model_path

The path for periodically saved data file. Default is tempdir().

seed

Random number seed. Default is 46.

...

Other parameters.

Value

A list containing Model Objects.

Examples

Run this code

# NOT RUN {
sub = cv_split(UCICreditCard, k = 40)[[1]]
dat = UCICreditCard[sub,]
dat = re_name(dat, "default.payment.next.month", "target")
dat = cleaning_data(dat, target = "target", obs_id = "ID", 
occur_time = "apply_date", miss_values = list("", -1, -2))
train_test <- train_test_split(dat, split_type = "OOT", prop = 0.7,
                                occur_time = "apply_date")
dat_train = train_test$train
dat_test = train_test$test
x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2")
B_model = training_model(dat_train = dat_train,
model_name = "UCICreditCard", target = "target", x_list = x_list,
occur_time = "apply_date", obs_id = "ID", dat_test = dat_test,
                           preproc = FALSE,
                           feature_filter = NULL,
                           algorithm = list("LR"),
                           LR.params = lr_params(lasso = FALSE, 
                           step_wise = FALSE, vars_plot = FALSE),
                           XGB.params = xgb_params(),
                           breaks_list = NULL,
                           parallel = FALSE, cores_num = NULL,
                           save_pmml = FALSE, plot_show = FALSE,
                           model_path = tempdir(),
                           seed = 46)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples