optimal.thresholds
calculates optimal thresholds for Presence/Absence data by any of several methods.
optimal.thresholds(DATA = NULL, threshold = 101, which.model = 1:(ncol(DATA)-2),
model.names = NULL, na.rm = FALSE, opt.methods = NULL, req.sens, req.spec,
obs.prev = NULL, smoothing = 1, FPC, FNC)
If DATA
is not provided function will return a vector of the possible optimization methods.
Otherwise, returns a dataframe where:
[,1] | Method - names of optimization methods |
[,2] | optimal thresholds for the first model |
[,3] | optimal thresholds for the second model, etc... |
a matrix or dataframe of observed and predicted values where each row represents one plot and where columns are:
DATA[,1] | plot ID | text | |||
DATA[,2] | observed values | zero-one values | |||
DATA[,3] | predicted probabilities from first model | numeric (between 0 and 1) | |||
DATA[,4] | predicted probabilities from second model, etc... |
cutoff values between zero and one used for translating predicted probabilities into 0 /1 values, defaults to 0.5. It can be a single value between zero and one, a vector of values between zero and one, or a positive integer representing the number of evenly spaced thresholds to calculate. To get reasonably good optimizations, there should be a large number of thresholds.
a number or vector indicating which models from DATA
should be used
a vector of the names of each model included in DATA
to be used as column names
a logical indicating whether missing values should be removed
what methods should be used to optimize thresholds. Given either as a vector of method names or method numbers. Possible values are:
1 | Default | threshold=0.5 |
2 | Sens=Spec | sensitivity=specificity |
3 | MaxSens+Spec | maximizes (sensitivity+specificity)/2 |
4 | MaxKappa | maximizes Kappa |
5 | MaxPCC | maximizes PCC (percent correctly classified) |
6 | PredPrev=Obs | predicted prevalence=observed prevalence |
7 | ObsPrev | threshold=observed prevalence |
8 | MeanProb | mean predicted probability |
9 | MinROCdist | minimizes distance between ROC plot and (0,1) |
10 | ReqSens | user defined required sensitivity |
11 | ReqSpec | user defined required specificity |
12 | Cost | user defined relative costs ratio |
a value between zero and one giving the user defined required sensitivity. Only used if opt.thresholds
= TRUE
. Note that req.sens
= (1-maximum allowable errors for points with positive observations).
a value between zero and one giving the user defined required sspecificity. Only used if opt.thresholds
= TRUE
. Note that req.sens
= (1- maximum allowable errors for points with negative observations).
observed prevalence for opt.method
= "PredPrev=Obs"
and "ObsPrev"
. Defaults to observed prevalence from DATA
.
smoothing factor for maximizing/minimizing. Only used if opt.thresholds
= TRUE
. Instead of find the threshold that gives the max/min value, function will average the thresholds of the given number of max/min values.
False Positive Costs, or for C/B ratio C = 'net costs of treating nondiseased individuals'.
False Negative Costs, or for C/B ratio B = 'net benefits of treating diseased individuals'.
Elizabeth Freeman eafreeman@fs.fed.us
The 'opt.methods' argument is allows the user to choose optimization
methods. The methods can be specified by number (opt.methods
= 1:12
or opt.methods
= c(1,2,4)
) or by name (opt.methods
= c("Default","Sens=Spec","MaxKappa")
).
There are currently twelve optimization criteria available:
"Default"
First, the default criteria of setting 'threshold = 0.5'
"Sens=Spec"
The second criteria for optimizing threshold choice is by finding the threshold where sensitivity equals specificity. In other words, find the threshold where positive observations are just as likely to be wrong as negative observations.
Note: when threshold is optimized by criteria "Sens=Spec"
it is correlated to prevalence, so that rare species are given much lower thresholds than widespread species. As a result, rare species may give the appearance of inflated distribution, if maps are made with thresholds that have been optimized by this method (Manel, 2001).
"MaxSens+Spec"
The third criteria chooses the threshold that maximizes the sum of sensitivity and specificity. In other words, it is minimizing the mean of the error rate for positive observations and the error rate for negative observations. This is equivalent to maximizing (sensitivity + specificity - 1), otherwise know as the Youden's index, or the True Skill Statistic. Note that while Youden's Index is independent of prevalence, using Youden's index to select a threshold does have an effect on the predicted prevalence, causing the distribution of rare species to be over predicted.
"MaxKappa"
The forth criteria for optimizing the threshold choice is to find the threshold that gives the maximum value of Kappa. Kappa makes full use of the information in the confusion matrix to asses the improvement over chance prediction.
"MaxPCC"
The fifth criteria is to maximize the total accuracy (PCC - Percent Correctly Classified).
Note: It may seem like maximizing total accuracy would be the obvious goal, however, there are many problems with using PCC to assess model accuracy. For example, with species with very low prevalence, it is possible to maximize PCC simply by declaring the species a absent at all locations -- not a very useful prediction!
"PredPrev=Obs"
The sixth criteria is to find the threshold where the Predicted prevalence is equal to the Observed prevalence. This is a useful method when preserving prevalence is of prime importance.
"ObsPrev"
The seventh criteria is an even simpler variation, where you simply set the threshold to the Observed prevalence. It is nearly as good as method six at preserving prevalence and requires no computation.
"MeanProb"
The eighth criteria also requires no threshold computation. Method eight sets the threshold to the mean probability of occurrence from the model results.
"MinROCdist"
The ninth criteria is to find the threshold that minimizes the distance between the ROC plot and the upper left corner of the unit square.
"ReqSens"
The tenth criteria allows the user to set a required sensitivity, and then finds the highest threshold that will meet this requirement. In other words, the user can decide that the model must miss no more than, for example 15 percent of the plots where the species is observed to be present. Therefore they require a sensitivity of at least 0.85. This may be useful if, for example, the goal is to define a management area for a rare species, and they want to be certain that the management area doesn't leave unprotected too many populations.
"ReqSpec"
The eleventh criteria allows the user to set a required specificity, and then finds the lowest threshold that will meet this requirement. In other words, the user can decide that the model must miss no more than, for example 15 percent of the plots where the species is observed to be absent. Therefore they require a specificity of at least 0.85. This may be useful if, for example, the goal is to determine if a species is threatened, and they want to be certain not to over inflate the population by over declaring true absences as predicted presences.
Note: for "ReqSens"
and "ReqSpec"
, if your model is poor, and your requirement is too strict, it is possible that the only way to meet it will be by declaring every single plot to be Present (for ReqSens) or Absent (for ReqSpec) -- not a very useful method of prediction! Conversely, if the model is good, and the requirement too lax, the resulting thresholds will result in unnecessary levels on inaccuracy. If a threshold exists where sensitivity equals specificity at a value greater than the required accuracy, then the user can raise their required specificity (or sensitivity) without sacrificing sensitivity (or specificity).
"Cost"
The twelth criteria balances the relative costs of false positive predictions and false negative predictions. A slope is calculated as (FPC/FNC)((1 - prevalence)/prevalence). To determine the threshold, a line of this slope is moved from the top left of the ROC plot, till it first touches the ROC curve.
Note: the criteria "Cost"
can also be used for C/B ratio analysis of diagnostic tests. In this case FPC
= C
(the net costs of treating nondiseased individuals) and FNC
= B
(the net benafits of treating diseased individuals). For further information on "Cost"
see Wilson et. al. (2005) and Cantor et. al. (1999).
For all the criteria that depend on observed prevalence ("PredPrev=Obs"
, "ObsPrev"
and cost
) , the default is to use the observed prevalence from DATA
. However, the argument obs.prev
can be used to substiture a predetermined value for observed prevalence, for example, the prevalence from a larger dataset.
error.threshold.plot
is a rough and ready function. It optimizes thresholds simply by calculating a large number of evenly spaced thresholds and looking for the best ones. This is good enough for graphs, but to find the theoretically 'best' thresholds, would require calculating every possible unique threshold (not necessarily evenly spaced!).
Details on smoothing
argument: when the statistic being maximized (e.g. Kappa) is relatively flat but erratic, just picking the threshold that gives single maximum value is somewhat arbitrary. smoothing
compensates for this by taking an average of the thresholds that give a set number of the highest values (e.g. the 10 highest Kappa's, or the 20 highest Kappa's).
S.B. Cantor, C.C. Sun, G. Tortolero-Luna, R. Richards-Kortum, and M. Follen. A comparison of C/B ratios from studies using receiver operating characteristic curve analysis. Journal of Clinical Epidemiology, 52(9):885-892, 1999.
S. Manel, H.C. Williams, and S.J. Ormerod. Evaluating presence-absence models in ecology: the need to account for prevalence. Journal of Applied Ecology, 38:921-931, 2001. K.A. Wilson, M.I. Westphal, H.P. Possingham. and J. Elith. Sensitivity of conservation planning to different approaches to using predicted species distribution data. Biological Conservation, 22(1):99-112, 2004.
error.threshold.plot, presence.absence.accuracy, roc.plot.calculate, presence.absence.summary
data(SIM3DATA)
optimal.thresholds(SIM3DATA)
Run the code above in your browser using DataLab