Learn R Programming

nproc (version 2.1.5)

npc: Construct a Neyman-Pearson Classifier from a sample of class 0 and class 1.

Description

Given a type I error upper bound alpha and a violation upper bound delta, npc calculates the Neyman-Pearson Classifier which controls the type I error under alpha with probability at least 1-delta.

Usage

npc(x = NULL, y, method = c("logistic", "penlog", "svm", "randomforest",
  "lda", "slda", "nb", "nnb", "ada", "tree"), alpha = 0.05, delta = 0.05,
  split = 1, split.ratio = 0.5, n.cores = 1, band = FALSE,
  nfolds = 10, randSeed = 0, warning = TRUE, ...)

Arguments

x

n * p observation matrix. n observations, p covariates.

y

n 0/1 observatons.

method

base classification method.

  • logistic: Logistic regression. glm function with family = 'binomial'

  • penlog: Penalized logistic regression with LASSO penalty. glmnet in glmnet package

  • svm: Support Vector Machines. svm in e1071 package

  • randomforest: Random Forest. randomForest in randomForest package

  • lda: Linear Discriminant Analysis. lda in MASS package

  • slda: Sparse Linear Discriminant Analysis with LASSO penalty.

  • nb: Naive Bayes. naiveBayes in e1071 package

  • nnb: Nonparametric Naive Bayes. naive_bayes in naivebayes package

  • ada: Ada-Boost. ada in ada package

alpha

the desirable upper bound on type I error. Default = 0.05.

delta

the violation rate of the type I error. Default = 0.05.

split

the number of splits for the class 0 sample. Default = 1. For ensemble version, choose split > 1.

split.ratio

the ratio of splits used for the class 0 sample to train the base classifier. The rest are used to estimate the threshold. Can also be set to be "adaptive", which will be determined using a data-driven method implemented in find.optim.split. Default = 0.5.

n.cores

number of cores used for parallel computing. Default = 1. WARNING: windows machine is not supported.

band

whether to generate both lower and upper bounds of type II error. Default = FALSE.

nfolds

number of folds for performing adaptive split ratio selection. Default = 10.

randSeed

the random seed used in the algorithm.

warning

whether to show various warnings in the program. Default = TRUE.

...

additional arguments.

Value

An object with S3 class npc.

fits

a list of length max(1,split), represents the fit during each split.

method

the base classification method.

split

the number of splits used.

References

Xin Tong, Yang Feng, and Jingyi Jessica Li (2018), Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC), Science Advances, 4, 2, eaao1659.

See Also

nproc and predict.npc

Examples

Run this code
# NOT RUN {
set.seed(1)
n = 1000
x = matrix(rnorm(n*2),n,2)
c = 1+3*x[,1]
y = rbinom(n,1,1/(1+exp(-c)))
xtest = matrix(rnorm(n*2),n,2)
ctest = 1+3*xtest[,1]
ytest = rbinom(n,1,1/(1+exp(-ctest)))

##Use lda classifier and the default type I error control with alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda')
pred = predict(fit,xtest)
fit.score = predict(fit,x)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

# }
# NOT RUN {
##Ensembled lda classifier with split = 11,  alpha=0.05, delta=0.05
fit = npc(x, y, method = 'lda', split = 11)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to logistic regression and change alpha to 0.1
fit = npc(x, y, method = 'logistic', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, change the method to adaboost
fit = npc(x, y, method = 'ada', alpha = 0.1)
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')

##Now, try the adaptive splitting ratio
fit = npc(x, y, method = 'ada', alpha = 0.1, split.ratio = 'adaptive')
pred = predict(fit,xtest)
accuracy = mean(pred$pred.label==ytest)
cat('Overall Accuracy: ',  accuracy,'\n')
ind0 = which(ytest==0)
typeI = mean(pred$pred.label[ind0]!=ytest[ind0]) #type I error on test set
cat('Type I error: ', typeI, '\n')
cat('Splitting ratio:', fit$split.ratio)
# }

Run the code above in your browser using DataLab