Learn R Programming

creditmodel (version 1.0)

cleaning_data: Data Cleaning

Description

The cleaning_data function is a simpler wrapper for data cleaning functions, such as delete variables that values are all NAs;checking dat and target format.;delete low variance variables.;replace null or NULL or blank with NA; encode variables which NAs & miss value rate is more than 95

Usage

cleaning_data(dat, target = NULL, x_list = NULL, obs_id = NULL,
  occur_time = NULL, pos_flag = NULL, miss_values = NULL,
  ex_cols = NULL, outlier_proc = TRUE, missing_proc = TRUE,
  default_miss = TRUE, low_var = TRUE, parallel = FALSE,
  note = FALSE, save_data = FALSE, dir_path = tempdir(),
  file_name = NULL)

Arguments

dat

A data frame with x and target.

target

The name of target variable.

x_list

A list of x variables.

obs_id

The name of ID of observations.Default is NULL.

occur_time

The name of occur time of observations.Default is NULL.

pos_flag

The value of positive class of target variable, default: "1".

miss_values

Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "Unknown".

ex_cols

A list of excluded variables. Default is NULL.

outlier_proc

Logical, process outliers or not. Default is TRUE.

missing_proc

Logical, process nas or not. Default is TRUE.

default_miss

Logical. If TRUE, assigning the missing values to -1 or "Unknown", otherwise ,processing the missing values according to the results of missing analysis.

low_var

Logical, delete low variance variables or not. Default is TRUE.

parallel

Logical, parallel computing or not. Default is FALSE.

note

Logical. Outputs info. Default is TRUE.

save_data

Logical, save the result or not. Default is FALSE.

dir_path

The path for periodically saved data file. Default is "./data".

file_name

The name for periodically saved data file. Default is NULL.

Value

A preprocessed data.frame

See Also

remove_duplicated, null_blank_na, entry_rate_max, entry_rate_na, low_variance_filter, process_nas, process_outliers

Examples

Run this code
# NOT RUN {
#data cleaning
dat_cl <- cleaning_data(dat = UCICreditCard[1:2000,],
                       target = "default.payment.next.month",
                       x_list = NULL,
                       obs_id = "ID",
                       occur_time = "apply_date",
                       ex_cols = c("PAY_6|BILL_"),
                       outlier_proc = TRUE,
                       missing_proc = TRUE,
                       default_miss = FALSE,
                       low_var = TRUE,
                       save_data = FALSE)

# }

Run the code above in your browser using DataLab