flag: Fast Lags and Leads for Time Series and Panel Data

Description

flag is an S3 generic to compute (sequences of) lags and leads. L and F are wrappers around flag representing the lag- and lead-operators, such that L(x,-1) = F(x,1) = F(x) and L(x,-3:3) = F(x,3:-3). L and F provide more flexibility than flag when applied to data frames (i.e. column subsetting, formula input and id-variable-preservation capabilities…), but are otherwise identical.

(flag is more of a programmers function in style of the Fast Statistical Functions while L and F are more practical to use in regression formulas or for computations on data frames.)

Usage

flag(x, n = 1, …)
   L(x, n = 1, …)
   F(x, n = 1, …)
# S3 method for default
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for default
L(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for default
F(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for matrix
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = length(n) > 1L, …)
# S3 method for matrix
L(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for matrix
F(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = TRUE, …)
# S3 method for data.frame
flag(x, n = 1, g = NULL, t = NULL, fill = NA, stubs = length(n) > 1L, …)
# S3 method for data.frame
L(x, n = 1, by = NULL, t = NULL, cols = is.numeric,
  fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# S3 method for data.frame
F(x, n = 1, by = NULL, t = NULL, cols = is.numeric,
  fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# Methods for compatibility with plm:
# S3 method for pseries
flag(x, n = 1, fill = NA, stubs = TRUE, …)
# S3 method for pseries
L(x, n = 1, fill = NA, stubs = TRUE, …)
# S3 method for pseries
F(x, n = 1, fill = NA, stubs = TRUE, …)
# S3 method for pdata.frame
flag(x, n = 1, fill = NA, stubs = length(n) > 1L, …)
# S3 method for pdata.frame
L(x, n = 1, cols = is.numeric, fill = NA, stubs = TRUE,
  keep.ids = TRUE, …)
# S3 method for pdata.frame
F(x, n = 1, cols = is.numeric, fill = NA, stubs = TRUE,
  keep.ids = TRUE, …)
# Methods for grouped data frame / compatibility with dplyr:
# S3 method for grouped_df
flag(x, n = 1, t = NULL, fill = NA, stubs = length(n) > 1L, keep.ids = TRUE, …)
# S3 method for grouped_df
L(x, n = 1, t = NULL, fill = NA, stubs = TRUE, keep.ids = TRUE, …)
# S3 method for grouped_df
F(x, n = 1, t = NULL, fill = NA, stubs = TRUE, keep.ids = TRUE, …)

Arguments

a vector / time series, (time series) matrix, data frame, panel series (plm::pseries), panel data frame (plm::pdata.frame) or grouped data frame (class 'grouped_df'). Data must not be numeric i.e you can also lag a date variable, character data etc…

integer. A vector indicating the lags / leads to compute (passing negative integers to flag or L computes leads, passing negative integers to F computes lags).

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.

same input as g/by, to indicate the time-variable(s). For safe computation of differences on unordered time series and panels. Data Frame method also allows one-sided formula i.e. ~time. grouped_df method supports lazy-evaluation i.e. time (no quotes).

cols

data.frame method: Select columns to difference using a function, column names, indices or a logical vector. Default: All numeric variables. Note: cols is ignored if a two-sided formula is passed to by.

fill

value to insert when vectors are shifted. Default is NA.

stubs

logical. TRUE will rename all lagged / leaded columns by adding a stub or prefix "Ln." / "Fn.".

keep.ids

data.frame / pdata.frame / grouped_df methods: Logical. Drop all panel-identifiers from the output (which includes all variables passed to by or t). Note: For grouped / panel data frames identifiers are dropped, but the 'groups' / 'index' attributes are kept.

…

arguments to be passed to or from other methods.

Value

x lagged / leaded n-times, grouped by g/by, ordered by t. See Details and Examples.

Details

If a single integer is passed to n, and g/by and t are left empty, flag/L/F just returns x with all columns lagged / leaded by n. If length(n)>1, and x is an atomic vector (time series), flag/L/F returns a (time series) matrix with lags / leads computed in the same order as passed to n. If instead x is a matrix / data frame, a matrix / data frame with ncol(x)*length(n) columns is returned where columns are sorted first by variable and then by lag (so all lags computed on a variable are grouped together). x can be of any standard data type.

With groups/panel-identifiers supplied to g/by, flag/L/F efficiently computes a panel-lag/lead by shifting the entire vector(s) but inserting fill elements in the right places. If t is left empty, the data needs to be ordered such that all values belonging to a group are consecutive and in the right order. It is not necessary that the groups themselves occur in the right order. If a time-variable is supplied to t (or a list of time-variables uniquely identifying the time-dimension), the panel is fully identified and lags / leads can be securely computed even if the data is unordered.

It is also possible to lag unordered or irregular time series utilizing only the t argument to identify the temporal dimension of the data.

Since v1.5.0 flag/L/F provide full built-in support for irregular time series and unbalanced panels. The suggested workaround using the seqid function is therefore no longer necessary.

Computationally, if both g/by and t are supplied, flag/L/F uses two initial passes to create an ordering through which the data are accessed. First-pass: Calculate minimum and maximum time-value for each individual. Second-pass: Generate the ordering by placing the current element index into the vector slot obtained by adding the cumulative group size and the current time-value subtracted its individual-minimum together. This method of computation is faster than any sort-based method and delivers optimal performance if the panel-id supplied to g/by is already a factor variable, and if t is either an integer or factor variable. If t is not factor or integer but instead is.double(t) && !is.object(t), it is assumed to be integer represented by double and converted using as.integer(t). For other objects such as dates, t is grouped using qG or GRP (for multiple time identifiers). Similarly, if g/by is not factor or 'GRP' object, qG or GRP will be called to group the respective identifier. Since grouping is more expensive than computing lags, prepare the data for optimal performance (or use plm classes). See also the Note.

The methods applying to plm objects (panel series and panel data frames) automatically utilize the factor panel-identifiers attached to these objects and thus securely and efficiently compute fully identified panel-lags. If these objects have > 2 panel-identifiers attached to them, the last identifier is assumed to be the time-variable, and the others are taken as grouping-variables and interacted. Note that flag/L/F is significantly faster than plm::lag/plm::lead since the latter is written in R and based on a Split-Apply-Combine logic.

Examples

Run this code

# NOT RUN {
## Simple Time Series: AirPassengers
L(AirPassengers)                      # 1 lag
F(AirPassengers)                      # 1 lead

all_identical(L(AirPassengers),       # 3 identical ways of computing 1 lag
              flag(AirPassengers),
              F(AirPassengers, -1))

head(L(AirPassengers, -1:3))          # 1 lead and 3 lags - output as matrix

## Time Series Matrix of 4 EU Stock Market Indicators, 1991-1998
tsp(EuStockMarkets)                                     # Data is recorded on 260 days per year
freq <- frequency(EuStockMarkets)
plot(stl(EuStockMarkets[,"DAX"], freq))                 # There is some obvious seasonality
head(L(EuStockMarkets, -1:3 * freq))                    # 1 annual lead and 3 annual lags
summary(lm(DAX ~., data = L(EuStockMarkets,-1:3*freq))) # DAX regressed on it's own annual lead,
                                                        # lags and the lead/lags of the other series

## World Development Panel Data
head(flag(wlddev, 1, wlddev$iso3c, wlddev$year))        # This lags all variables,
head(L(wlddev, 1, ~iso3c, ~year))                       # This lags all numeric variables
head(L(wlddev, 1, ~iso3c))                              # Without t: Works because data is ordered
head(L(wlddev, 1, PCGDP + LIFEEX ~ iso3c, ~year))       # This lags GDP per Capita & Life Expectancy
head(L(wlddev, 0:2, ~ iso3c, ~year, cols = 9:10))       # Same, also retaining original series
head(L(wlddev, 1:2, PCGDP + LIFEEX ~ iso3c, ~year,      # Two lags, dropping id columns
       keep.ids = FALSE))

# Different ways of regressing GDP on its's lags and life-Expectancy and it's lags
summary(lm(PCGDP ~ ., L(wlddev, 0:2, ~iso3c, ~year, 9:10, keep.ids = FALSE)))     # 1 - Precomputing
summary(lm(PCGDP ~ L(PCGDP,1:2,iso3c,year) + L(LIFEEX,0:2,iso3c,year), wlddev))   # 2 - Ad-hoc
summary(lm(PCGDP ~ L(PCGDP,1:2,iso3c) + L(LIFEEX,0:2,iso3c), wlddev))             # 3 - same no year
g = qF(wlddev$iso3c); t = qF(wlddev$year)                                         # 4- Precomputing
summary(lm(PCGDP ~ L(PCGDP,1:2,g,t) + L(LIFEEX,0:2,g,t), wlddev))                 # panel-id's
# }
# NOT RUN {
 <!-- % No code relying on suggested package -->
## Using plm:
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c","year"))
head(L(pwlddev, 0:2, 9:10))                                     # Again 2 lags of GDP and LIFEEX
PCGDP <- pwlddev$PCGDP                                          # A panel-Series of GDP per Capita
head(L(PCGDP))                                                  # Lagging the panel series
summary(lm(PCGDP ~ ., L(pwlddev, 0:2, 9:10, keep.ids = FALSE))) # Running the lm again
# THIS DOES NOT WORK: -> a pseries is only created when subsetting the pdata.frame using $ or [[
summary(lm(PCGDP ~ L(PCGDP,1:2) + L(LIFEEX,0:2), pwlddev))      # ..so L.default is used here..
LIFEEX <- pwlddev$LIFEEX                                        # To make it work, create pseries
summary(lm(PCGDP ~ L(PCGDP,1:2) + L(LIFEEX,0:2)))               # THIS WORKS !

## Using dplyr:
library(dplyr)
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX) %>% L(0:2)
wlddev %>% group_by(iso3c) %>% select(year,PCGDP,LIFEEX) %>% L(0:2,year) # Also using t (safer)
# }

Run the code above in your browser using DataLab