quotesCleanup: Cleans quote data

Description

This is a wrapper function for cleaning the quote data in the entire folder dataSource. The result is saved in the folder dataDestination.

In case you supply the argument qDataRaw, the on-disk functionality is ignored and the function returns the cleaned quotes as xts or data.table object (see examples).

The following cleaning functions are performed sequentially: noZeroQuotes, exchangeHoursOnly, autoSelectExchangeQuotes or selectExchange, rmNegativeSpread, rmLargeSpread mergeQuotesSameTimestamp, rmOutliersQuotes.

Usage

quotesCleanup(
  dataSource = NULL,
  dataDestination = NULL,
  exchanges = "auto",
  qDataRaw = NULL,
  report = TRUE,
  selection = "median",
  maxi = 50,
  window = 50,
  type = "standard",
  marketOpen = "09:30:00",
  marketClose = "16:00:00",
  rmoutliersmaxi = 10,
  printExchange = TRUE,
  saveAsXTS = FALSE,
  tz = NULL
)

Value

The function converts every (compressed) csv (or rds) file in dataSource into multiple xts or data.table files.

In dataDestination, there will be one folder for each symbol containing .rds files with cleaned data stored either in data.table or xts format.

In case you supply the argument qDataRaw, the on-disk functionality is ignored and the function returns a list with the cleaned quotes as an xts or data.table object depending on input (see examples).

Arguments

dataSource

character indicating the folder in which the original data is stored.

dataDestination

character indicating the folder in which the cleaned data is stored.

exchanges

vector of stock exchange symbols for all data in dataSource, e.g. exchanges = c("T","N") retrieves all stock market data from both NYSE and NASDAQ. The possible exchange symbols are:

A: AMEX
N: NYSE
B: Boston
P: Arca
C: NSX
T/Q: NASDAQ
D: NASD ADF and TRF
X: Philadelphia
I: ISE
M: Chicago
W: CBOE
Z: BATS

. The default value is "auto" which automatically selects the exchange for the stocks and days independently using the autoSelectExchangeQuotes

qDataRaw

xts or data.table object containing raw quote data, possibly for multiple symbols over multiple days. This argument is NULL by default. Enabling it means the arguments dataSource and dataDestination will be ignored. (only advisable for small chunks of data)

report

boolean and TRUE by default. In case it is true and we don't use the on-disk functionality, the function returns (also) a vector indicating how many quotes were deleted by each cleaning step.

selection

argument to be passed on to the cleaning routine mergeQuotesSameTimestamp. The default is "median".

maxi

spreads which are greater than median spreads of the day times maxi are excluded.

window

argument to be passed on to the cleaning routine rmOutliersQuotes.

type

argument to be passed on to the cleaning routine rmOutliersQuotes.

marketOpen

passed to exchangeHoursOnly. A character in the format of "HH:MM:SS", specifying the starting hour, minute and second of an exchange.

marketClose

passed to exchangeHoursOnly. A character in the format of "HH:MM:SS", specifying the closing hour, minute and second of an exchange.

rmoutliersmaxi

argument to be passed on to the cleaning routine rmOutliersQuotes.

printExchange

Argument passed to autoSelectExchangeQuotes indicates whether the chosen exchange is printed on the console, default is TRUE. This is only used when exchanges is "auto"

saveAsXTS

indicates whether data should be saved in xts format instead of data.table when using on-disk functionality. FALSE by default, which means we save as data.table.

tz

fallback time zone used in case we we are unable to identify the timezone of the data, by default: tz = NULL. With the non-disk functionality, we attempt to extract the timezone from the DT column (or index) of the data, which may fail. In case of failure we use tz if specified, and if it is not specified, we use "UTC". In the on-disk functionality, if tz is not specified, the timezone used will be the system default.

Author

Jonathan Cornelissen, Kris Boudt, Onno Kleen, and Emil Sjoerup.

Details

Using the on-disk functionality with .csv.zip files which is the standard from the WRDS database will write temporary files on your machine - we try to clean up after it, but cannot guarantee that there won't be files that slip through the crack if the permission settings on your machine does not match ours.

If the input data.table does not contain a DT column but it does contain DATE and TIME_M columns, we create the DT column by REFERENCE, altering the data.table that may be in the user's environment!

References

Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A., and Shephard, N. (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.

Brownlees, C.T. and Gallo, G.M. (2006). Financial econometric analysis at ultra-high frequency: Data handling concerns. Computational Statistics & Data Analysis, 51, pages 2232-2245.

Falkenberry, T.N. (2002). High frequency data filtering. Unpublished technical report.

Examples

Run this code

data.table::setDTthreads(2)
# Consider you have raw quote data for 1 stock for 2 days
head(sampleQDataRaw)
dim(sampleQDataRaw)
qDataAfterCleaning <- quotesCleanup(qDataRaw = sampleQDataRaw, exchanges = "N")
qDataAfterCleaning$report
dim(qDataAfterCleaning$qData)

# In case you have more data it is advised to use the on-disk functionality
# via "dataSource" and "dataDestination" arguments

Run the code above in your browser using DataLab