psi_iv_filter
is for selecting important and stable features using IV & PSI.
psi_iv_filter(
dat,
dat_test = NULL,
target,
x_list = NULL,
breaks_list = NULL,
pos_flag = NULL,
ex_cols = NULL,
occur_time = NULL,
best = FALSE,
equal_bins = TRUE,
g = 10,
sp_values = NULL,
tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10),
bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi
= 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1),
oot_pct = 0.7,
psi_i = 0.1,
iv_i = 0.01,
cos_i = 0.7,
vars_name = FALSE,
note = TRUE,
parallel = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir(),
...
)
A data.frame with independent variables and target variable.
A data.frame of test data. Default is NULL.
The name of target variable.
Names of independent variables.
A table containing a list of splitting points for each independent variable. Default is NULL.
The value of positive class of target variable, default: "1".
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL.
The name of the variable that represents the time at which each observation takes place.
Logical, if TRUE, merge initial breaks to get optimal breaks for binning.
Logical, if TRUE, equal sample size initial breaks generates.If FALSE , tree breaks generates using desison tree.
Integer, number of initial bins for equal_bins.
A list of missing values.
the list of tree parameters.
the list of parameters.
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7
The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1
The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.01
cos_similarity of posive rate of train and test. 0.7 to 0.9 usually work.Default: 0.5.
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE.
Logical, outputs info. Default is TRUE.
Logical, parallel computing. Default is FALSE.
Logical, save results in locally specified folder. Default is FALSE.
The name for periodically saved results files. Default is "Feature_importance_IV_PSI".
The path for periodically saved results files. Default is tempdir().
Other parameters.
A list with the following elements:
Feature
Selected variables.
IV
IV of variables.
PSI
PSI of variables.
COS
cos_similarity of posive rate of train and test.
# NOT RUN {
psi_iv_filter(dat= UCICreditCard[1:1000,c(2,4,8:9,26)],
target = "default.payment.next.month",
occur_time = "apply_date",
parallel = FALSE)
# }
Run the code above in your browser using DataLab