modifiedBisquareWeights: Modified Bisquare Weights

Description

Calculation of bisquare weights and the intermediate weight factors similar to those used in the calculation of biweight midcovariance and midcorrelation. The weights are designed such that outliers get smaller weights; the weights become zero for data points more than 9 median absolute deviations from the median.

Usage

modifiedBisquareWeights(
  x,
  removedCovariates = NULL,
  pearsonFallback = TRUE,
  maxPOutliers = 0.05,
  outlierReferenceWeight = 0.1,
  groupsForMinWeightRestriction = NULL,
  minWeightInGroups = 0,
  maxPropUnderMinWeight = 1,
  defaultWeight = 1,
  getFactors = FALSE)

Value

When the input getFactors is TRUE, a list with two components:

weights: A matrix of the same dimensions and dimnames as the input x giving the weights of the individual observations in x.
factors: A matrix of the same form as weights giving the weight factors.

When the input getFactors is FALSE, the function returns the matrix of weights.

Arguments

x: A matrix of numeric observations with variables (features) in columns and observations (samples) in rows. Weights will be calculated separately for each column.
removedCovariates: Optional matrix or data frame of variables that are to be regressed out of each column of x before calculating the weights. If given, must have the same number of rows as x.
pearsonFallback: Logical: for columns of x that have zero median absolute deviation (MAD), should the appropriately scaled standard deviation be used instead?
maxPOutliers: Optional numeric scalar between 0 and 1. Specifies the maximum proportion of outliers in each column, i.e., data with weights equal to outlierReferenceWeight below.
outlierReferenceWeight: A number between 0 and 1 specifying what is to be considered an outlier when calculating the proportion of outliers.
groupsForMinWeightRestriction: An optional vector with length equal to the number of samples (rows) in x giving a categorical variable. The output factors and weights are adjusted such that in samples at each level of the variable, the weight is below minWeightInGroups in a fraction of samples that is at most maxPropUnderMinWeight.
minWeightInGroups: A threshold weight, see groupsForMinWeightRestriction and details.
maxPropUnderMinWeight: A proportion (number between 0 and 1). See groupsForMinWeightRestriction and details.
defaultWeight: Value used for weights that would be undefined or not finite, for example, when a column in x is constant.
getFactors: Logical: should the intermediate weight factors be returned as well?

Author

Peter Langfelder

Details

Weights are calculated independently for each column of x. Denoting a column of x as y, the weights are calculated as \((1-u^2)^2\) where u is defined as \(\min(1, |y-m|/(9MMAD))\). Here m is the median of the column y and MMAD is the modified median absolute deviation. We call the expression \(|y-m|/(9 MMAD)\) the weight factors. Note that outliers are observations with high (>1) weight factors for outliers but low (zero) weights.

The calculation of MMAD starts with calculating the (unscaled) median absolute deviation of the column x. If the median absolute deviation is zero and pearsonFallback is TRUE, it is replaced by the standard deviation (multiplied by qnorm(0.75) to make it asymptotically consistent with MAD). The following two conditions are then checked: (1) The proportion of weights below outlierReferenceWeight is at most maxPOutliers and (2) if groupsForMinWeightRestriction has non-zero length, then for each individual level in groupsForMinWeightRestriction, the proportion of samples with weights less than minWeightInGroups is at most maxPropUnderMinWeight. (If groupsForMinWeightRestriction is zero-length, the second condition is considered trivially satisfied.) If both conditions are met, MMAD equals the median absolute deviation, MAD. If either condition is not met, MMAD equals the lowest number for which both conditions are met.

References

A full description of the weight calculation can be found, e.g., in Methods section of

Wang N, Langfelder P, et al (2022) Mapping brain gene coexpression in daytime transcriptomes unveils diurnal molecular networks and deciphers perturbation gene signatures. Neuron. 2022 Oct 19;110(20):3318-3338.e9. PMID: 36265442; PMCID: PMC9665885. tools:::Rd_expr_doi("10.1016/j.neuron.2022.09.028")

Other references include, in reverse chronological order,

Peter Langfelder, Steve Horvath (2012) Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. https://www.jstatsoft.org/v46/i11/

"Introduction to Robust Estimation and Hypothesis Testing", Rand Wilcox, Academic Press, 1997.

"Data Analysis and Regression: A Second Course in Statistics", Mosteller and Tukey, Addison-Wesley, 1977, pp. 203-209.