flag-L-F: Fast Lags and Leads for Time-Series and Panel Data

Description

flag is an S3 generic to compute (sequences of) lags and leads. L and F are wrappers around flag representing the lag- and lead-operators, such that L(x,-1) = F(x,1) = F(x) and L(x,-3:3) = F(x,3:-3). L & F provide more flexibility than flag when applied to data frames (i.e. column subsetting, formula input and id-variable-preservation capabilities...), but are otherwise identical.

(flag is more of a programmers function in style of the Fast Statistical Functions while L & F are more practical to use in regression formulas or for computations on data frames.)

Usage

flag(x, n = 1, …)
   L(x, n = 1, …)
   F(x, n = 1, …)

# S3 method for default
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for default
L(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for default
F(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)

# S3 method for matrix
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for matrix
L(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for matrix
F(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)

# S3 method for data.frame
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for data.frame
L(x, n = 1, by = NULL, t = NULL, cols = is.numeric,
  fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# S3 method for data.frame
F(x, n = 1, by = NULL, t = NULL, cols = is.numeric,
  fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# Methods for compatibility with plm:

# S3 method for pseries
flag(x, n = 1, fill = NA, stubs = TRUE, …)
# S3 method for pseries
L(x, n = 1, fill = NA, stubs = TRUE, …)
# S3 method for pseries
F(x, n = 1, fill = NA, stubs = TRUE, …)

# S3 method for pdata.frame
flag(x, n = 1, fill = NA, stubs = TRUE, …)
# S3 method for pdata.frame
L(x, n = 1, cols = is.numeric, fill = NA, stubs = TRUE,
  keep.ids = TRUE, …)
# S3 method for pdata.frame
F(x, n = 1, cols = is.numeric, fill = NA, stubs = TRUE,
  keep.ids = TRUE, …)
# Methods for compatibility with dplyr:

# S3 method for grouped_df
flag(x, n = 1, t = NULL, fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# S3 method for grouped_df
L(x, n = 1, t = NULL, fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# S3 method for grouped_df
F(x, n = 1, t = NULL, fill = NA, stubs = TRUE, keep.ids = TRUE, …)

Arguments

a vector, matrix, data.frame, panel-series (plm::pseries), panel-data.frame (plm::pdata.frame) or grouped tibble (dplyr::grouped_df). Note: Data must not be numeric.

an integer vector indicating the lags/leads to compute.

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.

same input as g, to indicate the time-variable. For safe computation of lags/leads on unordered time-series and panels. Note: Data frame method also allows one-sided formula i.e. ~time, and grouped_df method also allows lazy-evaluation i.e. time (no quotes).

cols

data.frame method: Select columns to lag/lead using a function, column names or indices. Default: All numeric variables. Note: cols is ignored if a two-sided formula is passed to by.

fill

value to insert when vectors are shifted. Default is NA.

stubs

logical. TRUE will rename all lagged / leaded columns by adding a stub or prefix "Ln." / "Fn.".

keep.ids

data.frame / pdata.frame / grouped_df methods: Logical. Drop all panel-identifiers from the output (which includes all variables passed to by or t). Note: For panel-data.frame's and grouped tibbles identifiers are dropped, but the 'index' / 'groups' attributes are kept.

…

arguments to be passed to or from other methods.

Value

x lagged / leaded n-times, grouped by g/by, ordered by t. See Details and Examples.

Details

If a single integer is passed to n, and g/by and t are left empty, flag/L/F just returns x with all columns lagged / leaded by n. If length(n)>1, and x is an atomic vector, flag/L/F returns a matrix with lags / leads computed in the same order as passed to n. If instead x is a matrix / data.frame, a matrix / data.frame with ncol(x)*length(n) columns is returned where columns are sorted first by variable and then by lag (so all lags computed on a variable are grouped together). x can be of any standard data type.

With groups/panel-identifiers supplied to g/by, flag/L/F efficiently computes a panel-lag by shifting the entire vector(s) but inserting fill elements in the right places. If t is left empty, the data needs to be ordered such that all values belonging to a group are consecutive and in the right order. It is not necessary that the groups themselves occur in the right order. If a time-variable is supplied to t (or a list of time-variables uniquely identifying the time-dimension), the panel is fully identified and lags / leads can be securely computed even if the data is completely unordered (in that case data is shifted around and fill values are inserted in such a way that if the data were sorted afterwards the result would be identical to computing lags / leads on sorted data). Internally this works by using the grouping- and time-variable(s) to create an ordering and then accessing the panel-vector(s) through this ordering. If the data is just a bit unordered, such computations are nearly as fast as computations on ordered data (without t), however, if the data is very unordered, it can take significantly longer. Since most panel-data come perfectly or pretty ordered, I recommend always supplying t to be on the safe-side.

It is also possible to compute lags / leads on unordered time-series (thus utilizing t but leaving g/by empty), although this is probably more rare to encounter than unordered panels.

The methods applying to plm objects (panel-series and panel-data.frames) automatically utilize the panel-identifiers attached to these objects and thus securely compute fully identified panel-lags. If these objects have > 2 panel-identifiers attached to them, the last identifier is assumed to be the time-variable, and the others are taken as grouping-variables and interacted. I note that flag/L/F is significantly faster than plm::lag/plm::lead since the latter is written in R and based on a Split-Apply-Combine logic.

Examples

Run this code

# NOT RUN {
## Simple Time-Series: Airpassengers
L(AirPassengers)                      # 1 lag
F(AirPassengers)                      # 1 lead

all_identical(L(AirPassengers),       # 3 identical ways of computing 1 lag
              flag(AirPassengers),
              F(AirPassengers,-1))

L(AirPassengers,-1:3)                 # 1 lead and 3 lags - output as matrix

## Time-Series Matrix of 4 EU Stock Market Indicators, 1991-1998
tsp(EuStockMarkets)                                     # Data is recorded on 260 days per year
freq <- frequency(EuStockMarkets)
plot(stl(EuStockMarkets[,"DAX"], freq))                 # There is some obvious seasonality
L(EuStockMarkets,-1:3*freq)                             # 1 annual lead and 3 annual lags
summary(lm(DAX ~., data = L(EuStockMarkets,-1:3*freq))) # DAX regressed on it's own annual lead,
                                                        # lags and the lead/lags of the other series

## World Development Panel Data
head(flag(wlddev, 1, wlddev$iso3c, wlddev$year))        # This lags all variables,
head(L(wlddev, 1, ~iso3c, ~year))                       # This lags all numeric variables
head(L(wlddev, 1, ~iso3c))                              # Without t: Works because data is ordered
head(L(wlddev, 1, PCGDP + LIFEEX ~ iso3c, ~year))       # This lags GDP per Capita & Life Expectancy
head(L(wlddev, 0:2, ~ iso3c, ~year, cols = 9:10))       # Same, also retaining original series
head(L(wlddev, 1:2, PCGDP + LIFEEX ~ iso3c, ~year,      # Two lags, dropping id columns
       keep.ids = FALSE))

# Different ways of regressing GDP on its's lags and life-Expectancy and it's lags
summary(lm(PCGDP ~ ., L(wlddev, 0:2, ~iso3c, ~year, 9:10, keep.ids = FALSE)))     # 1 - Precomputing
summary(lm(PCGDP ~ L(PCGDP,1:2,iso3c,year) + L(LIFEEX,0:2,iso3c,year), wlddev))   # 2 - Ad-hoc
summary(lm(PCGDP ~ L(PCGDP,1:2,iso3c) + L(LIFEEX,0:2,iso3c), wlddev))             # 3 - same no year
g = qF(wlddev$iso3c); t = qF(wlddev$year)                                         # 4- Precomputing
summary(lm(PCGDP ~ L(PCGDP,1:2,g,t) + L(LIFEEX,0:2,g,t), wlddev))                 # panel-id's

## Using plm:
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c","year"))
head(L(pwlddev, 0:2, 9:10))                                     # Again 2 lags of GDP and LIFEEX
PCGDP <- pwlddev$PCGDP                                          # A panel-Series of GDP per Capita
L(PCGDP)                                                        # Lagging the panel series
summary(lm(PCGDP ~ ., L(pwlddev, 0:2, 9:10, keep.ids = FALSE))) # Running the lm again: WORKS!
# THIS DOES NOT WORK: Unfortunately lm drops the attributes of the columns,
# so L.default is used here and ordinary lags are computed. (with and attach don't retain attr.)
summary(lm(PCGDP ~ L(PCGDP,1:2) + L(LIFEEX,0:2), pwlddev))
LIFEEX <- pwlddev$LIFEEX                                        # To make it work, create pseries
summary(lm(PCGDP ~ L(PCGDP,1:2) + L(LIFEEX,0:2)))               # THIS WORKS !!

## Using dplyr:
library(dplyr)
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX) %>% L(0:2)
wlddev %>% group_by(iso3c) %>% select(year,PCGDP,LIFEEX) %>% L(0:2,year) # Also using t (safer)
# }

Run the code above in your browser using DataLab