Calculate the false discovery rate (type I error) under repeated testing and determine which variables to select and to exclude from multivariate analysis.
FDR(data = NULL, sp.cols = NULL, var.cols = NULL, pvalues = NULL,
model.type = NULL, family = "auto", correction = "fdr", q = 0.05,
verbose = TRUE, simplif = FALSE)
a data frame containing the response and predictor variables (one in each column).
index number of the column containing the response variable (currently implemented for only one response variable at a time).
index numbers of the columns containing the predictor variables.
optionally, instead of data
, sp.cols
and var.cols
,
a data frame with the names of the predictor variables in the first column and
their bivariate p-values (obtained elsewhere) in the second column. Example:
pvalues <- data.frame(var = letters[1:5], pval = c(0.02, 0.004, 0.07, 0.03, 0.05)).
this argument (previously a character value, either "LM" or "GLM") is now deprecated and ignored with a warning if provided. This information is now included in argument family
- e.g., if you want linear models (LM), you can set family = 'gaussian'
.
The error distribution and (optionally) the link function to use (see glm
or family
for details). The default, 'auto'
, automatically uses 'binomial' family for response variables containing only values 0 and 1, 'poisson' for positive integer responses (i.e. count data), and 'gaussian' (i.e. linear models) otherwise.
the correction procedure to apply to the p-values; see
p.adjust.methods
for available options and p.adjust
for more information. The default is "fdr".
the threshold value of FDR-corrected significance above which to reject variables. Defaults to 0.05.
logical value indicating whether to display messages.
logical value indicating if simplified results should be provided (see Value).
If simplif = TRUE
, this function returns a data frame with the variables' names as row names and 4 columns containing, respectively, their individual (bivariate) coefficients against the response, their individual AIC (Akaike's Information Criterion; Akaike, 1973), BIC (Bayesian Information Criterion, also known as Schwarz criterion, SBC, SBIC; Schwarz, 1978), p-value and adjusted p-value according to the applied correction
.
If simplif = FALSE
(the default), the result is a list of two such data frames:
with the variables to exclude.
with the variables to select (under the given q
value).
It is common in ecology to search for statistical relationships between species' occurrence and a set of predictor variables. However, when a large number of variables is analysed (compared to the number of observations), false findings may arise due to repeated testing. Garcia (2003) recommended controlling the false discovery rate (FDR; Benjamini & Hochberg 1995) in ecological studies. The p.adjust
R function performs this and other corrections to the significance (p) values of variables under repeated testing. The FDR
function performs repeated regressions (either linear or binary logistic) or uses already-obtained p values for a set of variables; calculates the FDR with p.adjust
; and shows which variables should be retained for or excluded from further multivariate analysis according to their corrected p values (see, for example, Barbosa, Real & Vargas 2009).
The FDR function uses the Benjamini & Hochberg ("BH", alias "fdr") correction by default, but check the p.adjust
documentation for other available methods, namely "BY", which allows for non-independent data. Input data may be the response variable (for example, the presence-absence or abundance of a species) and the predictors (a table with one independent variable in each column, with the same number of rows and in the same order as the response); there should be no missing values in the data. Alternatively, you may already have performed the univariate regressions and have a set of variables and corresponding p values which you want to correct with FDR; in this case, get a table with your variables' names in the first column and their p values in the second column, and supply it as the pvalues
argument (no need to provide response or predictors in this case).
Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov B.N. & Csaki F., 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2-8, 1971, Budapest: Akademiai Kiado, p. 267-281.
Barbosa A.M., Real R. & Vargas J.M (2009) Transferability of environmental favourability models in geographic space: The case of the Iberian desman (Galemys pyrenaicus) in Portugal and Spain. Ecological Modelling 220: 747-754
Benjamini Y. & Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57: 289-300
Garcia L.V. (2003) Controlling the false discovery rate in ecological research. Trends in Ecology and Evolution 18: 553-554
Schwarz, G.E. (1978) Estimating the dimension of a model. Annals of Statistics, 6 (2): 461-464.
# NOT RUN {
data(rotif.env)
names(rotif.env)
FDR(data = rotif.env, sp.cols = 18, var.cols = 5:17)
FDR(data = rotif.env, sp.cols = 18, var.cols = 5:17, simplif = TRUE)
# }
Run the code above in your browser using DataLab