This is a wrapper function for cleaning the quote data in the entire folder dataSource
.
The result is saved in the folder dataDestination
.
In case you supply the argument qDataRaw
, the on-disk functionality is ignored
and the function returns the cleaned quotes as xts
or data.table
object (see examples).
The following cleaning functions are performed sequentially:
noZeroQuotes
, exchangeHoursOnly
, autoSelectExchangeQuotes
or selectExchange
, rmNegativeSpread
, rmLargeSpread
mergeQuotesSameTimestamp
, rmOutliersQuotes
.
quotesCleanup(
dataSource = NULL,
dataDestination = NULL,
exchanges = "auto",
qDataRaw = NULL,
report = TRUE,
selection = "median",
maxi = 50,
window = 50,
type = "standard",
marketOpen = "09:30:00",
marketClose = "16:00:00",
rmoutliersmaxi = 10,
printExchange = TRUE,
saveAsXTS = FALSE,
tz = NULL
)
The function converts every (compressed) csv (or rds) file in dataSource
into multiple xts
or data.table
files.
In dataDestination
, there will be one folder for each symbol containing .rds files with cleaned data stored either in data.table
or xts
format.
In case you supply the argument qDataRaw
, the on-disk functionality is ignored
and the function returns a list with the cleaned quotes as an xts
or data.table
object depending on input (see examples).
character indicating the folder in which the original data is stored.
character indicating the folder in which the cleaned data is stored.
vector of stock exchange symbols for all data in dataSource,
e.g. exchanges = c("T","N")
retrieves all stock market data from both NYSE and NASDAQ.
The possible exchange symbols are:
A: AMEX
N: NYSE
B: Boston
P: Arca
C: NSX
T/Q: NASDAQ
D: NASD ADF and TRF
X: Philadelphia
I: ISE
M: Chicago
W: CBOE
Z: BATS
. The default value is "auto"
which automatically selects the exchange for the stocks and days independently using the autoSelectExchangeQuotes
xts
or data.table
object containing raw quote data, possibly for multiple symbols over multiple days. This argument is NULL
by default.
Enabling it means the arguments dataSource
and dataDestination
will be ignored. (only advisable for small chunks of data)
boolean and TRUE
by default. In case it is true and we don't use the on-disk functionality, the function returns (also) a vector indicating how many quotes were deleted by each cleaning step.
argument to be passed on to the cleaning routine mergeQuotesSameTimestamp
. The default is "median"
.
spreads which are greater than median spreads of the day times maxi
are excluded.
argument to be passed on to the cleaning routine rmOutliersQuotes
.
argument to be passed on to the cleaning routine rmOutliersQuotes
.
passed to exchangeHoursOnly
. A character in the format of "HH:MM:SS"
,
specifying the starting hour, minute and second of an exchange.
passed to exchangeHoursOnly
. A character in the format of "HH:MM:SS"
,
specifying the closing hour, minute and second of an exchange.
argument to be passed on to the cleaning routine rmOutliersQuotes
.
Argument passed to autoSelectExchangeQuotes
indicates whether the chosen exchange is printed on the console,
default is TRUE
. This is only used when exchanges
is "auto"
indicates whether data should be saved in xts
format instead of data.table
when using on-disk functionality. FALSE
by default, which means we save as data.table
.
fallback time zone used in case we we are unable to identify the timezone of the data, by default: tz = NULL
. With the non-disk functionality, we attempt to extract the timezone from the DT column (or index) of the data, which may fail.
In case of failure we use tz
if specified, and if it is not specified, we use "UTC"
.
In the on-disk functionality, if tz
is not specified, the timezone used will be the system default.
Jonathan Cornelissen, Kris Boudt, Onno Kleen, and Emil Sjoerup.
Using the on-disk functionality with .csv.zip files which is the standard from the WRDS database will write temporary files on your machine - we try to clean up after it, but cannot guarantee that there won't be files that slip through the crack if the permission settings on your machine does not match ours.
If the input data.table
does not contain a DT
column but it does contain DATE
and TIME_M
columns, we create the DT
column by REFERENCE, altering the data.table
that may be in the user's environment!
Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A., and Shephard, N. (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.
Brownlees, C.T. and Gallo, G.M. (2006). Financial econometric analysis at ultra-high frequency: Data handling concerns. Computational Statistics & Data Analysis, 51, pages 2232-2245.
Falkenberry, T.N. (2002). High frequency data filtering. Unpublished technical report.
data.table::setDTthreads(2)
# Consider you have raw quote data for 1 stock for 2 days
head(sampleQDataRaw)
dim(sampleQDataRaw)
qDataAfterCleaning <- quotesCleanup(qDataRaw = sampleQDataRaw, exchanges = "N")
qDataAfterCleaning$report
dim(qDataAfterCleaning$qData)
# In case you have more data it is advised to use the on-disk functionality
# via "dataSource" and "dataDestination" arguments
Run the code above in your browser using DataLab