This research aimed at the case of customers's default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 24 variables as explanatory variables
A data frame with 30000 rows and 26 variables.
ID: Customer id
apply_date: This is a fake occur time.
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX: Gender (male; female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year) History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:
PAY_0: the repayment status in September
PAY_2: the repayment status in August
PAY_3: ...
PAY_4: ...
PAY_5: ...
PAY_6: the repayment status in April The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months;...;8 = payment delay for eight months; 9 = payment delay for nine months and above. Amount of bill statement (NT dollar)
BILL_AMT1: amount of bill statement in September
BILL_AMT2: mount of bill statement in August
BILL_AMT3: ...
BILL_AMT4: ...
BILL_AMT5: ...
BILL_AMT6: amount of bill statement in April Amount of previous payment (NT dollar)
PAY_AMT1: amount paid in September
PAY_AMT2: amount paid in August
PAY_AMT3: ....
PAY_AMT4: ...
PAY_AMT5: ...
PAY_AMT6: amount paid in April
default.payment.next.month: default payment (Yes = 1, No = 0), as the response variable