In practice, market microstructure noise leads to a departure from the pure semimartingale model. We consider the process \(Y\) in period \(\tau\):
$$
\mbox{Y}_{\tau} = X_{\tau} + \epsilon_{\tau},
$$
where the observed \(d\) dimensional log-prices are the sum of underlying Brownian semimartingale process \(X\) and a noise term \(\epsilon_{\tau}\).
\(\epsilon_{\tau}\) is an i.i.d. process with \(X\).
It is intuitive that under mean zero i.i.d. microstructure noise some form of smoothing of the observed log-price should tend to diminish the impact of the noise.
Effectively, we are going to approximate a continuous function by an average of observations of \(Y\) in a neighborhood, the noise being averaged away.
Assume there is \(N\) equispaced returns in period \(\tau\) of a list (after refreshing data).
Let \(r_{\tau_i}\) be a return (with \(i=1, \ldots,N\)) of an asset in period \(\tau\). Assume there is \(d\) assets.
In order to define the univariate pre-averaging estimator, we first define the pre-averaged returns as
$$
\bar{r}_{\tau_j}^{(k)}= \sum_{h=1}^{k_N-1}g\left(\frac{h}{k_N}\right)r_{\tau_{j+h}}^{(k)}
$$
where g is a non-zero real-valued function \(g:[0,1]\) \(\rightarrow\) \(R\) given by \(g(x)\) = \(\min(x,1-x)\). \(k_N\) is a sequence of integers satisfying \(\mbox{k}_{N} = \lfloor\theta N^{1/2}\rfloor\).
We use \(\theta = 0.8\) as recommended in Hautsch and Podolskij (2013). The pre-averaged returns are simply a weighted average over the returns in a local window.
This averaging diminishes the influence of the noise. The order of the window size \(k_n\) is chosen to lead to optimal convergence rates.
The pre-averaging estimator is then simply the analogue of the realized variance but based on pre-averaged returns and an additional term to remove bias due to noise
$$
\hat{C}= \frac{N^{-1/2}}{\theta \psi_2}\sum_{i=0}^{N-k_N+1} (\bar{r}_{\tau_i})^2-\frac{\psi_1^{k_N}N^{-1}}{2\theta^2\psi_2^{k_N}}\sum_{i=0}^{N}r_{\tau_i}^2
$$
with
$$
\psi_1^{k_N}= k_N \sum_{j=1}^{k_N}\left(g\left(\frac{j+1}{k_N}\right)-g\left(\frac{j}{k_N}\right)\right)^2,\quad
$$
$$
\psi_2^{k_N}= \frac{1}{k_N}\sum_{j=1}^{k_N-1}g^2\left(\frac{j}{k_N}\right).
$$
$$
\psi_2= \frac{1}{12}
$$
The multivariate counterpart is very similar. The estimator is called the Modulated Realized Covariance (rMRCov) and is defined as
$$
\mbox{MRC}= \frac{N}{N-k_N+2}\frac{1}{\psi_2k_N}\sum_{i=0}^{N-k_N+1}\bar{\boldsymbol{r}}_{\tau_i}\cdot \bar{\boldsymbol{r}}'_{\tau_i} -\frac{\psi_1^{k_N}}{\theta^2\psi_2^{k_N}}\hat{\Psi}
$$
where \(\hat{\Psi}_N = \frac{1}{2N}\sum_{i=1}^N \boldsymbol{r}_{\tau_i}(\boldsymbol{r}_{\tau_i})'\). It is a bias correction to make it consistent.
However, due to this correction, the estimator is not ensured PSD.
An alternative is to slightly enlarge the bandwidth such that \(\mbox{k}_{N} = \lfloor\theta N^{1/2+\delta}\rfloor\). \(\delta = 0.1\) results in a consistent estimate without the bias correction and a PSD estimate, in which case:
$$
\mbox{MRC}^{\delta}= \frac{N}{N-k_N+2}\frac{1}{\psi_2k_N}\sum_{i=0}^{N-k_N+1}\bar{\boldsymbol{r}}_i\cdot \bar{\boldsymbol{r}}'_i
$$