In their 2015 JPM paper "Backtesting", Campbell Harvey and Yan Liu (HL) discuss the common practice of adjusting or 'haircutting' a set of backtest results to correct for assumed overfitting. In the industry, this haircut is often assumed to be 50% of reported performance. They propose and demonstrate three methods of adjusting for potential multiple testing bias by adapting three methods for adjusting confidence on multiple trials from the statstical literature.
haircutSharpe(portfolios, ..., strategy = NULL, trials = NULL,
audit = NULL, env = .GlobalEnv).haircutSR(sm_fre, num_obs, SR, ind_an, ind_aut, rho, num_test, RHO)
string name of portfolio, or optionally a vector of portfolios, see DETAILS
any other passtrhrough parameters
optional strategy specification that would contain more information on the process, default NULL
optional number of trials,default NULL
optional audit environment containing the results of parameter optimization or walk forward, default NULL
optional environment to find market data in, if required.
Sampling frequency; [1,2,3,4,5] = [Daily, Weekly, Monthly, Quarterly, Annual]
No. of observations in the frequency specified in the previous step
Sharpe ratio; either annualized or in the frequency specified in the previous step
Indicator; if annulized, 'ind_an' = 1; otherwise = 0
Indicator; if adjusted for autocorrelations, 'ind_aut' = 0; otherwise = 1
Autocorrelation coefficient at the specified frequency
'num_test': Number of tests allowed. e.g. Harvey, Liu and Zhu (2014) find 315 published equity risk factors
Average correlation among contemporaneous strategy returns
an object of type haircutSR
containing:
a data.frame
containing slots haircut_SR
,adj_pvalue
,and pct_adj
a data.frame
containing slots haircut_SR
,adj_pvalue
,and pct_adj
a data.frame
containing slots haircut_SR
,adj_pvalue
,and pct_adj
a data.frame
containing slots haircut_SR
,adj_pvalue
,and pct_adj
output frequency
Sampling frequency; [1,2,3,4,5] = [Daily, Weekly, Monthly, Quarterly, Annual]
No. of observations in the frequency specified in the previous step
observed Sharpe Ratio
Observed Sharpe Ratio corrected for autocorrelation
Indicator; if annulized, 'ind_an' = 1; otherwise = 0
Indicator; if adjusted for autocorrelations, 'ind_aut' = 0; otherwise = 1
Autocorrelation coefficient at the specified frequency
number or trials
Average correlation among contemporaneous strategy returns
call used for SharpeRatio.haircut
call used to call .haircutSR
To explain the link between the Sharpe ratio and the t-stat and the application of a multiple testing p-value adjustment, HL use the simplest case of an individual investment strategy. Assume a null hypothesis in which the strategy's mean return is significantly different from zero, therefore implying a 2-sided alternative hypothesis. A strategy can be regarded as profitable if its mean returns are either side of zero since investors can generally go long or short. Since returns will at least be asymptotically normally distributed (thanks to the Central Limit Theorem) a t-statistic following a t-distribution can be constructed and tested for significance. Due to the link between the Sharpe ratio and the t-stat it is possible to assess the significance of a strategy's excess returns directly using the Sharpe ratio. Assume \(\hat{\mu}\) denotes the mean of your sample of historical returns (daily or weekly etc) and \(\hat{\sigma}\) denotes standard deviation, then:
$$t-statistic = \frac{ \hat{\mu}}{(\frac{\hat{\sigma}}{\sqrt{T})}}$$
where \(T-1\) is degrees of freedom and since
$$\widehat{SR} = \frac{\hat{\mu}}{\hat{\sigma}}$$
it may be shown that
$$\widehat{SR} = \frac{t-statistic}{\sqrt{T}}$$
By implication a higher Sharpe ratio equates to a higher t-ratio, implying a higher significance level (lower p-value) for an investment strategy. If we denote p-value of the single test as \(p^s\) then we can present the p-value for a single test as:
$${p^s} = Pr( |r| > t-ratio)$$
or
$${p^s} = Pr( |r| > \widehat{SR}.\sqrt{T}$$
If the researcher was exploring a particular economic theory then this p-value might make sense, but what if the researcher has tested multiple strategies and presents only the most profitable one? In this case the p-value of the single test may severely overstate the actual significance. A more truthful p-value would be an adjusted multiple testing p-value which assuming we denote as \(p^m\) which could be represented as:
$${p^m} = Pr( max{|{r_i}|, i = 1,...,N} > t-ratio)$$
or
$$ {p^m} = 1 - (1 - {p^s}{)^N} $$
By equating the p-value of a single test to a multiple test p-value we get the defining equation of \(p^m\) which is
$$ {p^m} = Pr( {|{r_i}|} > \widehat{SR}.\sqrt{T}) $$
where
$$ p^m = 1-(1-{p^s}{)^N} $$
This function replicates the methods proposed by Harvey and Liu to adjust an observed Sharpe Ratio for the number of trials performed, the autocorrelation between the trials, the overall level of performance, and the presumed or observed correlation between trials.
We will refer to these methods as:
1. Bonferroni (BON)
2. Holm
3. Benjamini, Hochberg and Yekutieli (BHY)
Full details on the calculations and adjustments should be found in Harvey and Liu (2015). This documentation is just an overview to aid in easy use of the R function.
HL mention 3 well known adjustment methods in the statistics literature, which are originally prescribed in the paper "...and the Cross-Section of Expected Returns" by Harvey, Liu and Zhu. These are Bonferroni, Holm, and Benjamini, Hochberg, and Yekutieli (BHY).
1. Bonferroni (BON)
$${p_{(i)}}^Bonferroni = min {|{M * p_i, 1}|}$$
Bonferroni applies the same adjustment to the p-value of each test, inflating the p-value by the number of tests. The multiple testing p-value is the minimum of each inflated p-value and 1 where 1 (or 100% if you prefer) is the upper bound of probability. HL use the example of p-values from 6 strategies where the p-values are (0.005, 0.009, 0.0128, 0.0135, 0.045, 0.06). According to a 5% significance cutoff the first 5 tests would be considered significant. Using the p.adjust function in R we can get the multiple adjusted p-values and according to Bonferroni only the first test would be considered significant.
2. Holm
p-value adjustments can be categorized into 2 categories, namely: single-step and sequential. Single-step corrections equally adjust p-values as in Bonferroni. Sequential adjustments are an adaptive procedure based on the distribution of p-values. Sequential methods gained prominence after a seminal paper by Schweder & Spjotvoll (1982) and section 7.3 of this paper gives a useful example of an application of multiple testing hypothesis diagnostic plotting in R. Holm is an example of a sequential multiple testing procedure. For Holm, the equivalent adjusted p-value is
$${p_{(i)}}^Holm = min[max((M - j + 1)*{p_{(j)}} ),1]$$
Bonferroni adjusts single tests equally, whereas Holm applies a sequential approach. By conclusion it should not surprise you that adjusted Sharpe ratios under Bonferroni will therefore be lower than for Holm. At this point it is useful to mention that both Holm and Bonferroni attempt to prevent even 1 Type I error occurring, controlling what is called the family-wise error rate (FWER). The next adjustment proposed by HL is BHY and the main difference from the previous 2 adjustment methods is that BHY attempts to control the false discovery rate (FDR), implying more lenience than Holm and Bonferroni and therefore expected to yield higher adjusted Sharpe ratios.
3. BHY
BHY's formulation of the FDR can be represented as follows. First all p-values are sorted in descending order and the adjusted p-value sequence is defined by pairwise comparisons.
TODO: BHY equation
We expect BHY to be more lenient as it controls the false discovery rate whereas Holm and Bonferroni control the family-wise error rate, trying to eliminate making even 1 false discovery. Bonferroni is more stringent than Holm since it is a single-step adjustment versus the sequential approach of Holm. With these 3 methods HL attempt to adjust p-values to account for multiple testing and then convert these to haircut Sharpe ratios and in so doing control for data mining. Both Holm and BHY require the empirical distribution of p-values from previously tried strategies.
Empirical Study
Harvey, Liu and Zhu (2016, HLZ) provides a large study of multiple testing bias by examining market anomalies or risk factors previously published in major peer reviewed journals. In constructing such a large study, they needed to correct for multiple potential issues in the analysis, including lack of complete data on all trials, lack of the number of failed trials, correlation among the published trials, and data snooping or look ahead bias as later researchers learned features of the data from prior studies.
HLZ model over 300 risk factors documented in the finance literature. However, using this model for the distribution of p-values is not complete since many tried strategies would not have been documented (referred to as Publication Bias) plus they are potentially correlated thereby violating the requirement for independence between tests. HLZ propose a new distribution to overcome these shortfalls.
HLZ publish the list of resources they studied, over 300 factors for explaining the cross section of return patterns. See http://faculty.fuqua.duke.edu/~charvey/Factor-List.xlsx. There is a clear pattern of increasing factor discovery with each decade (HLZ, Figure 2: Factors and Publications, p.20). Assuming statistical and economic soundness of published t-statistics, HLZ conduct the 3 multiple testing procedures described earlier. Their conclusion, assuming all tried factors are published is that an appropriate minimum threshold t-statistic for 5% significance is 2.8. This equates to a p-value of only 0.50% for single tests. Of course the assumption that all tried factors are published is not reasonable, and therefore the analysis does suggest a minimum threshold for accepting the significance of future tests, ie. less than or equal to 0.50%.
HLZ limit their sample of factors to unique factors thereby minimizing test dependence which is a requirement for the 3 multiple testing procedures they propose. Since we know the requirements for being published are fairly stringent, HLZ estimate that 71% of tried tests are not published. See appendix B of HLZ for details. Using this number of tested factors together with the 3 multiple testing procedures they propose a benchmark t-statistic of 3.18. This required threshold is intuitively larger than the 2.8 threshold generated assuming a lower number of tests.
Acknowledging the inevitable presence of test dependence and correlation
among published test statistics (think of the many price multiple factors for
instance) HLZ propose a "direct modeling approach" in which only t-statistics
are required to account for this correlation. Correction for correlation in
multiple testing procedures has only recently been documented in the
statistics literature, and methods typically include simulating the entire
time series to construct an empirical distribution for the range of test
statistics (see e.g. mcsim
and
txnsim
). Of course the luxury of access to the
entire dataset is not generally available to the risk factor researcher or
potential investor being presented with a backtest, so HLZ propose a
"Truncated Exponential Distribution" for modelling the t-statistic sample of
published and unpublished results. The intuitive reasoning for a
monotonically decreasing exponential distribution for modelling t-statistics
is that finding factors with small t-statistics should be easier than larger
ones.
HLZ conclude that threshold cutoffs are increasing through time, imposing higher scrutiny to data mining today than to data mining in the past. Their justification is summarized by 3 reasons:
1. The easily discovered factors have already been discovered.
2. In Finance there is a limited amount of data, compared with particle physics for example where an experiment can create trillions of new observaions.
3. The relative costs of data mining in the past were much higher than they are today, implying the more economically sound principle factors were likely to be tested earlier.
In multiple hypothesis testing the challenge is to guard against false discoveries. HL argue that the appropriate "haircut Sharpe ratio" is non-linear, in that the highest Sharpe ratios (SR's) are only moderately penalized whilst marginal SR's more so. The implication is that high SR's are more likely true discoveries in a multiple hypothesis testing framework.
HL mention 5 caveats to their framework, namely;
Sharpe ratios may not be appropriate metrics for strategies with negatively skewed expected payoffs, such as option strategies.
Sharpe ratios normalize returns based on their volatility (ie. market risk), which may not be the most appropriate reflection of risk for a strategy.
Determining the appropriate significance level for multiple testing (where in single tests 5% is the normal cutoff).
Which multiple testing method you choose could yield different conclusions. HL proposes 3 methods together with an average.
The number of trials used to adjust for multiple tests
Harvey, Campbell R. and Yan Liu. 2015. Backtesting The Journal of Portfolio Management. 41:1 pp. 13-28.
Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2016. "... and the cross-section of expected returns." The Review of Financial Studies 29, no. 1 (2016): 5-68.
Mackie, Jasen. 2016. R-view: Backtesting - Harvey & Liu (2015). https://opensourcequant.wordpress.com/2016/11/17/r-view-backtesting-harvey-liu-2015/