fdiff: Fast (Quasi-, Log-) Differences for Time Series and Panel Data

Description

fdiff is a S3 generic to compute (sequences of) suitably lagged / leaded and iterated differences, quasi-differences, log-differences or quasi-log-differences. The difference and log-difference operators D and Dlog also exists as parsimonious wrappers around fdiff. Apart from being more parsimonious, they provide more flexibility than fdiff when applied to data frames.

Usage

fdiff(x, n = 1, diff = 1, …)
      D(x, n = 1, diff = 1, …)
   Dlog(x, n = 1, diff = 1, …)
# S3 method for default
fdiff(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, log = FALSE, rho = 1,
      stubs = TRUE, …)
# S3 method for default
D(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1,
  stubs = TRUE, …)
# S3 method for default
Dlog(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1, stubs = TRUE, …)
# S3 method for matrix
fdiff(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, log = FALSE, rho = 1,
      stubs = length(n) + length(diff) > 2L, …)
# S3 method for matrix
D(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1,
  stubs = TRUE, …)
# S3 method for matrix
Dlog(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, rho = 1, stubs = TRUE, …)
# S3 method for data.frame
fdiff(x, n = 1, diff = 1, g = NULL, t = NULL, fill = NA, log = FALSE, rho = 1,
      stubs = length(n) + length(diff) > 2L, …)
# S3 method for data.frame
D(x, n = 1, diff = 1, by = NULL, t = NULL, cols = is.numeric,
  fill = NA, rho = 1, stubs = TRUE, keep.ids = TRUE, …)
# S3 method for data.frame
Dlog(x, n = 1, diff = 1, by = NULL, t = NULL, cols = is.numeric,
     fill = NA, rho = 1, stubs = TRUE, keep.ids = TRUE, …)
# Methods for compatibility with plm:
# S3 method for pseries
fdiff(x, n = 1, diff = 1, fill = NA, log = FALSE, rho = 1, stubs = TRUE, …)
# S3 method for pseries
D(x, n = 1, diff = 1, fill = NA, rho = 1, stubs = TRUE, …)
# S3 method for pseries
Dlog(x, n = 1, diff = 1, fill = NA, rho = 1, stubs = TRUE, …)
# S3 method for pdata.frame
fdiff(x, n = 1, diff = 1, fill = NA, log = FALSE, rho = 1,
      stubs = length(n) + length(diff) > 2L, …)
# S3 method for pdata.frame
D(x, n = 1, diff = 1, cols = is.numeric, fill = NA, rho = 1, stubs = TRUE,
  keep.ids = TRUE, …)
# S3 method for pdata.frame
Dlog(x, n = 1, diff = 1, cols = is.numeric, fill = NA, rho = 1, stubs = TRUE,
     keep.ids = TRUE, …)
# Methods for compatibility with dplyr:
# S3 method for grouped_df
fdiff(x, n = 1, diff = 1, t = NULL, fill = NA, log = FALSE, rho = 1,
      stubs = length(n) + length(diff) > 2L, keep.ids = TRUE, …)
# S3 method for grouped_df
D(x, n = 1, diff = 1, t = NULL, fill = NA, rho = 1, stubs = TRUE,
  keep.ids = TRUE, …)
# S3 method for grouped_df
Dlog(x, n = 1, diff = 1, t = NULL, fill = NA, rho = 1, stubs = TRUE,
     keep.ids = TRUE, …)

Arguments

a numeric vector / time series, (time series) matrix, data frame, panel series (plm::pseries), panel data frame (plm::pdata.frame) or grouped tibble (dplyr::grouped_df).

integer. A vector indicating the number of lags or leads.

diff

integer. A vector of integers > 1 indicating the order of differencing / log-differencing.

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.

same input as g/by, to indicate the time-variable. For safe computation of differences on unordered time series and panels. Data Frame method also allows one-sided formula i.e. ~time. grouped_df method supports lazy-evaluation i.e. time (no quotes).

cols

data.frame method: Select columns to difference using a function, column names, indices or a logical vector. Default: All numeric variables. Note: cols is ignored if a two-sided formula is passed to by.

fill

value to insert when vectors are shifted. Default is NA.

log

logical. TRUE Computes log-differences instead. See Details.

rho

double. Autocorrelation parameter. Set to a value between 0 and 1 for quasi-differencing. However any numeric value can be supplied.

stubs

logical. TRUE will rename all differenced columns by adding prefixes "LnDdiff." / "FnDdiff." for differences "LnDlogdiff." / "FnDlogdiff." for log-differences and replacing "D" / "Dlog" with "QD" / "QDlog" for quasi-differences.

keep.ids

data.frame / pdata.frame / grouped_df methods: Logical. Drop all panel-identifiers from the output (which includes all variables passed to by or t). Note: For panel data frames and grouped tibbles identifiers are dropped, but the 'index' / 'groups' attributes are kept.

…

arguments to be passed to or from other methods.

Value

x differenced diff times using lags n of itself. Quasi and log-differences are toggled by the rho and log arguments or the Dlog operator. Computations can be grouped by g/by and/or ordered by t. See Details and Examples.

Details

By default, fdiff/D/Dlog return x with all columns differenced / log-differenced. Differences are computed as repeat(diff) x[i] - rho*x[i-n], and log-differences as repeat(diff) log(x[i]) - rho*log(x[i-n]). If rho < 1, this becomes quasi- (or partial) differencing, which is a technique suggested by Cochrane and Orcutt (1949) to deal with serial correlation in regression models, where rho is typically estimated by running a regression of the model residuals on the lagged residuals. Setting diff = 2 returns differences of differences etc… and setting n = 2 returns simple differences computed by subtracting twice-lagged x from x. It is also possible to compute forward differences by passing negative n values. n also supports arbitrary vectors of integers (lags), and diff supports positive sequences of integers (differences):

If more than one value is passed to n and/or diff, the data is expanded-wide as follows: If x is an atomic vector or time series, a (time series) matrix is returned with columns ordered first by lag, then by difference. If x is a matrix or data frame, each column is expanded in like manor such that the output has ncol(x)*length(n)*length(diff) columns ordered first by column name, then by lag, then by difference.

With groups/panel-identifiers supplied to g/by, fdiff/D/Dlog efficiently compute panel-differences. If t is left empty, the data needs to be ordered such that all values belonging to a group are consecutive and in the right order. It is not necessary that the groups themselves occur in the right order. If time-variable(s) are supplied to t, the panel is fully identified and differences can be securely computed even if the data is completely unordered.

fdiff/D/Dlog supports balanced panels and unbalanced panels where various individuals are observed for different time-sequences (both start, end and duration of observation can differ for each individual), but does not natively support irregularly spaced time series and panels. For computational details and efficiency considerations see the help page for flag. A work-around for differencing irregular panels is easily achieved with the help of seqid.

It is also possible to compute differences on unordered vectors / time series (thus utilizing t but leaving g/by empty).

The methods applying to plm objects (panel series and panel data frames) automatically utilize the panel-identifiers attached to these objects and thus securely compute fully identified panel-differences. If these objects have > 2 panel-identifiers attached to them, the last identifier is assumed to be the time-variable, and the others are taken as grouping-variables and interacted.

References

Cochrane, D.; Orcutt, G. H. (1949). Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms. Journal of the American Statistical Association. 44 (245): 32-61.

Prais, S. J. & Winsten, C. B. (1954). Trend Estimators and Serial Correlation. Cowles Commission Discussion Paper No. 383. Chicago.

Examples

Run this code

# NOT RUN {
## Simple Time Series: AirPassengers
D(AirPassengers)                      # 1st difference, same as fdiff(AirPassengers)
D(AirPassengers, -1)                  # Forward difference
Dlog(AirPassengers)                   # Log-difference
D(AirPassengers, 1, 2)                # Second difference
Dlog(AirPassengers, 1, 2)             # Second log-difference
D(AirPassengers, 12)                  # Seasonal difference (data is monthly)
D(AirPassengers,                      # Quasi-difference, see a better example below
  rho = pwcor(AirPassengers, L(AirPassengers)))

head(D(AirPassengers, -2:2, 1:3))     # Sequence of leaded/lagged and iterated differences

# let's do some visual analysis
plot(AirPassengers)                   # Plot the series - seasonal pattern is evident
plot(stl(AirPassengers, "periodic"))  # Seasonal decomposition
plot(D(AirPassengers,c(1,12),1:2))    # Plotting ordinary and seasonal first and second differences
plot(stl(window(D(AirPassengers,12),  # Taking seasonal differences removes most seasonal variation
                1950), "periodic"))


## Time Series Matrix of 4 EU Stock Market Indicators, recorded 260 days per year
plot(D(EuStockMarkets, c(0, 260)))                      # Plot series and annual differnces
mod <- lm(DAX ~., L(EuStockMarkets, c(0, 260)))         # Regressing the DAX on its annual lag
summary(mod)                                            # and the levels and annual lags others
r <- residuals(mod)                                     # Obtain residuals
pwcor(r, L(r))                                          # Residual Autocorrelation
fFtest(r, L(r))                                         # F-test of residual autocorrelation
                                                        # (better use lmtest::bgtest)
modCO <- lm(QD1.DAX ~., D(L(EuStockMarkets, c(0, 260)), # Cochrane-Orcutt (1949) estimation
                        rho = pwcor(r, L(r))))
summary(modCO)
rCO <- residuals(modCO)
fFtest(rCO, L(rCO))                                     # No more autocorrelation

## World Development Panel Data
head(fdiff(num_vars(wlddev), 1, 1,                      # Computes differences of numeric variables
             wlddev$country, wlddev$year))              # fdiff requires external inputs..
head(D(wlddev, 1, 1, ~country, ~year))                  # Differences of numeric variables
head(D(wlddev, 1, 1, ~country))                         # Without t: Works because data is ordered
head(D(wlddev, 1, 1, PCGDP + LIFEEX ~ country, ~year))  # Difference of GDP & Life Expectancy
head(D(wlddev, 0:1, 1, ~ country, ~year, cols = 9:10))  # Same, also retaining original series
head(D(wlddev, 0:1, 1, ~ country, ~year, 9:10,          # Dropping id columns
       keep.ids = FALSE))

# Dynamic Panel Data Models:
summary(lm(D(PCGDP,1,1,iso3c,year) ~                    # Diff. GDP regressed on it's lagged level
             L(PCGDP,1,iso3c,year) +                    # and the difference of Life Expanctancy
             D(LIFEEX,1,1,iso3c,year), data = wlddev))

g = qF(wlddev$country)                                  # Omitting t and precomputing g allows for
summary(lm(D(PCGDP,1,1,g) ~ L(PCGDP,1,g) +              # a bit more parsimonious specification
                            D(LIFEEX,1,1,g), wlddev))

summary(lm(D1.PCGDP ~.,                                 # Now adding level and lagged level of
L(D(wlddev,0:1,1, ~ country, ~year,9:10),0:1,           # LIFEEX and lagged differences rates
  ~ country, ~year, keep.ids = FALSE)[-1]))

## Using plm can make things easier, but avoid attaching or 'with' calls:
pwlddev <- plm::pdata.frame(wlddev, index = c("country","year"))
head(D(pwlddev, 0:1, 1, 9:10))                          # Again differences of LIFEEX and PCGDP
PCGDP <- pwlddev$PCGDP                                  # A panel-Series of GDP per Capita
head(D(PCGDP))                                          # Differencing the panel series
summary(lm(D1.PCGDP ~.,                                 # Running the dynamic model again ->
           data = L(D(pwlddev,0:1,1,9:10),0:1,          # code becomes a bit simpler
                    keep.ids = FALSE)[-1]))

# One could be tempted to also do something like this, but THIS DOES NOT WORK!!:
# -> a pseries is only created when subsetting the pdata.frame using $ or [[
summary(lm(D(PCGDP) ~ L(D(PCGDP,0:1)) + L(D(LIFEEX,0:1),0:1), pwlddev))

# To make it work, one needs to create pseries
LIFEEX <- pwlddev$LIFEEX
summary(lm(D(PCGDP) ~ L(D(PCGDP,0:1)) + L(D(LIFEEX,0:1),0:1))) # THIS WORKS !

## Using dplyr:
library(dplyr)
wlddev %>% group_by(country) %>%
             select(PCGDP,LIFEEX) %>% fdiff(0:1,1:2)       # Adding a first and second difference
wlddev %>% group_by(country) %>%
             select(year,PCGDP,LIFEEX) %>% D(0:1,1:2,year) # Also using t (safer)
wlddev %>% group_by(country) %>%                           # Dropping id's
             select(year,PCGDP,LIFEEX) %>% D(0:1,1:2,year, keep.ids = FALSE)

# }

Run the code above in your browser using DataLab