datastep: Step through data row-by-row

Description

The datastep function allows you to perform row-wise conditional processing on a data frame, data table, or tibble. The function contains parameters to drop, keep, or rename variables, perform by-group processing, and perform row-wise or column-wise calculations.

Usage

datastep(
  data,
  steps,
  keep = NULL,
  drop = NULL,
  rename = NULL,
  by = NULL,
  calculate = NULL,
  retain = NULL,
  attrib = NULL,
  arrays = NULL,
  sort_check = TRUE
)

Arguments

data

The data to step through.

steps

The operations to perform on the data. This parameter is typically specified as a set of R statements contained within curly braces.

keep

A vector of quoted variable names to keep in the output data set. By default, all variables are kept.

drop

A vector of quoted variable names to drop from the output data set. By default, no variables are dropped.

rename

A named vector of quoted variables to rename. The current variable name should be on the left hand side of the name/value pair, and the new variable name should be on the right. The rename operation is performed after the data step, the keep, and the drop. Therefore, the data steps should use the input variable name. By default, all variables retain their original names.

A vector of quoted variable names to use for by-group processing. This parameter will activate the first. and last. automatic variables, that indicate the first or last rows in a group. These automatic variables are useful for conditional processing on groups.

calculate

Steps to set up calculated variables. Calculated variables are commonly generated with summary functions such as mean, median, min, max, etc. It is more efficient to set up calculated variables with the calculate parameter and then use those variables in the data step, rather than perform the summary function inside the data step. The calculate block will be executed immediately before the data step.

retain

A list of variable names and initial values to retain. Retained variables will begin the data step with the initial value. Then for each iteration of the data step, the variable will be populated with the ending value from the previous step. The retain functionality allows you to perform cumulative operations or decisions based on the value of the previous iteration of the data step. Initial values should be of the expected data type for the column. For example, for a numeric column set the initial value to a zero, and for a character column, set the initial value to an empty string, i.e. retain = list(col1 = 0, col2 = ""). There is no default initial value for a variable. You must supply an initial value for each retained variable.

attrib

A named list of attributes. The list can be either dsattr objects or single default values. The dsattr object allows you to set more attributes on each column. The single default value is convenient if you simply want to create a variable. By default, variables will be created on the fly with no attributes.

arrays

A named list of dsarray objects. The dsarray is a list of columns which you can iterate over inside the data step. You can iterate over a dsarray either with a for loop, or with a vectorized function. The default value of the arrays parameter is NULL, meaning no arrays are defined.

sort_check

Checks to see if the input data is sorted according to the by variable parameter. The sort check will give an error if the input data is not sorted according to the by variable. The check is turned on if the value of sort_check is TRUE, and turned off if FALSE. The default value is TRUE. Turn the sort check off if you want to perform by-group processing on unsorted data, or data that is not sorted according to the by-group.

Value

The processed data frame, tibble, or data table.

Automatic Variables

The datastep function provides five automatic variables. These variables are generated for every data step, and can be accessed at any point within the data step:

data: Represents the entire input data frame.
rw: Represents the current row.
n.: Contains the row number.
first.: Indicates the beginning of a by-group.
last.: Indicates the end of a by-group.

Automatic variables will be dropped from the data frame at the end of the data step. If you wish to keep the automatic variable values, assign the automatic variable to a new variable and keep that variable.

Column Attributes

To set attributes for a column on your data, use the attrib parameter. Example attributes include 'label', 'description', and 'format'. These types of attributes are set using a named list and a dsattr object. The name of the list item is the column name you want to set attributes on. The value of the list item is the dsattr object. For a complete list of available attributes, see the dsattr documentation.

It should be mentioned that the dsattr object is not required. You can also set attributes with a name and a default value. The default value can be any valid data value, such as a number or string.

Optional Parameters

Optional parameters on the datastep allow you to shape the output dataset or enhance the operation of the datastep. Some parameters are classified as input parameters, and others as output parameters. Input parameters modify the data before the data step operations takes place. Output parameters operate on the data after the data step.

The keep, drop, and rename parameters are output parameters. These parameters will be applied after the data step statements are executed. Therefore, within the data step, refer to variables using the input variable name. New variables may be created on the fly, just by assigning a value to the new variable name.

The keep, drop, and rename parameters require quoted variable names, as the variables may not yet exist at the time they are passed into the function. Within a data step or calculate block, however, variable names do not need to be quoted.

The calculate parameter is used to perform vectorized functions on the data prior to executing the data step. For example, you may want to determine a mean for a variable in the calculate block, and then make decisions on that mean in the data step block.

The retain parameter allows you to access the prior row value. At the start of the data step, the retained variable is seeded with the initial value. For each subsequent step, the variable is seeded with the value of the prior step/row. This functionality allows you to increment values or perform cumulative operations.

calculate and retain are both input parameters.

Data Step Arrays

There are times you may want to iterate over columns in your data step. Such iteration is particularly useful when you have a wide dataset, and wish to perform the same operation on several columns. For instance, you may want to calculate the mean for 10 different variables on your dataset.

The arrays parameter allows you to iterate across columns. This parameter accepts a named list of dsarray objects. The dsarray is essentially a list of columns. You can use a for loop to iterate over the dsarray, and also send it into a vectorized function. Data step arrays allow to you to perform row-wise calculations. For instance, you can calculate a sum or mean by row for the variables in your array.

Output Column Order

By default, the data step will retain the column order of any variables that already exist on the input data set. New variables created in a data step will be appended to the right of existing variables. Yet these new variables can sometimes appear in an order that is unexpected or undesirable.

There are two ways to control the order of output columns: the keep parameter and the attrib parameter.

Columns names included on the keep parameter will appear in the order indicated on the keep vector. This ordering mechanism is appropriate when you have a small number of columns and can easily pass the entire keep list.

To control the order of new variables only, use the attrib parameter. New variables for which attributes are defined will appear in the order indicated on the attrib list. The attrib list is useful when you are adding a relatively small number of columns to an existing data set, and don't want to pass all the column names.

Remember that you can supply an attribute list with default values only, such as attrib = list(column1 = 0, column2 = ""). This style of attribute definition is convenient if you are only trying to control the order of columns.

If the above two mechanisms to control column order are not sufficient, use the data frame subset operators or column ordering functions provided by other packages.

Datastep Performance

The datastep is intended to be used on small and medium-sized datasets. It is not recommended for large datasets. If your dataset is greater than one million rows, you should consider other techniques for processing your data. While there is no built-in restriction on the number of rows, performance of the datastep can become unacceptable with a large number of rows.

Details

Two parameters on the datastep function are required: data and steps. The data parameter is the input data to the data step. The steps parameter contains the code statements you want to apply to the data. The steps should be wrapped in curly braces. When running, the data step will loop through the input data row-by-row, and execute the steps for each row. Variables inside the data step can be accessed using non-standard evaluation (meaning they do not have to be quoted).

Note that the data step is pipe-friendly. It can be used within a dplyr pipeline. The data step allows you to perform deeply nested and complex conditionals within the pipeline. The data step is also very readable compared to other pipeline conditionals.

Examples

Run this code

# NOT RUN {
# Example #1: Simple Data Step
df <- datastep(mtcars[1:10,], 
               keep = c("mpg", "cyl", "disp", "mpgcat", "recdt"), {
                 
  if (mpg >= 20) 
    mpgcat <- "High"
  else 
    mpgcat <- "Low"
                 
  recdt <- as.Date("1974-06-10")
                 
  if (cyl == 8)
    is8cyl <- TRUE
                 
})

df
#                    mpg cyl  disp mpgcat      recdt
# Mazda RX4         21.0   6 160.0   High 1974-06-10
# Mazda RX4 Wag     21.0   6 160.0   High 1974-06-10
# Datsun 710        22.8   4 108.0   High 1974-06-10
# Hornet 4 Drive    21.4   6 258.0   High 1974-06-10
# Hornet Sportabout 18.7   8 360.0    Low 1974-06-10
# Valiant           18.1   6 225.0    Low 1974-06-10
# Duster 360        14.3   8 360.0    Low 1974-06-10
# Merc 240D         24.4   4 146.7   High 1974-06-10
# Merc 230          22.8   4 140.8   High 1974-06-10
# Merc 280          19.2   6 167.6    Low 1974-06-10

# Example #2: By-group Processing
df <- datastep(mtcars[1:10,], 
               keep = c("mpg", "cyl", "gear", "grp"), 
               by = c("gear"), sort_check = FALSE, {
                 
  if (first.)
    grp <- "Start"
  else if (last.)
    grp <- "End"
  else 
    grp <- "-"
                 
})

df
#                    mpg cyl gear   grp
# Mazda RX4         21.0   6    4 Start
# Mazda RX4 Wag     21.0   6    4     -
# Datsun 710        22.8   4    4   End
# Hornet 4 Drive    21.4   6    3 Start
# Hornet Sportabout 18.7   8    3     -
# Valiant           18.1   6    3     -
# Duster 360        14.3   8    3   End
# Merc 240D         24.4   4    4 Start
# Merc 230          22.8   4    4     -
# Merc 280          19.2   6    4   End 

# Example #3: Calculate Block
df <- datastep(mtcars, 
               keep = c("mpg", "cyl", "mean_mpg", "mpgcat"), 
               calculate = { mean_mpg = mean(mpg) }, {
                 
  if (mpg >= mean_mpg)
    mpgcat <- "High"
  else 
    mpgcat <- "Low"
                 
})

df[1:10,]
#                    mpg cyl mean_mpg mpgcat
# Mazda RX4         21.0   6 20.09062   High
# Mazda RX4 Wag     21.0   6 20.09062   High
# Datsun 710        22.8   4 20.09062   High
# Hornet 4 Drive    21.4   6 20.09062   High
# Hornet Sportabout 18.7   8 20.09062    Low
# Valiant           18.1   6 20.09062    Low
# Duster 360        14.3   8 20.09062    Low
# Merc 240D         24.4   4 20.09062   High
# Merc 230          22.8   4 20.09062   High
# Merc 280          19.2   6 20.09062    Low

# Example #4: Data pipeline
library(dplyr)
library(magrittr)

# Add datastep to dplyr pipeline
df <- mtcars %>% 
  select(mpg, cyl, gear) %>% 
  mutate(mean_mpg = mean(mpg)) %>% 
  datastep({
    
    if (mpg >= mean_mpg)
      mpgcat <- "High"
    else 
      mpgcat <- "Low"
    
  }) %>% 
  filter(row_number() <= 10)

df
#     mpg cyl gear mean_mpg mpgcat
# 1  21.0   6    4 20.09062   High
# 2  21.0   6    4 20.09062   High
# 3  22.8   4    4 20.09062   High
# 4  21.4   6    3 20.09062   High
# 5  18.7   8    3 20.09062    Low
# 6  18.1   6    3 20.09062    Low
# 7  14.3   8    3 20.09062    Low
# 8  24.4   4    4 20.09062   High
# 9  22.8   4    4 20.09062   High
# 10 19.2   6    4 20.09062    Low

# Example #5: Drop, Retain and Rename
df <- datastep(mtcars[1:10, ], 
               drop = c("disp", "hp", "drat", "qsec", 
                        "vs", "am", "gear", "carb"), 
               retain = list(cumwt = 0 ),
               rename = c(mpg = "MPG", cyl = "Cylinders", wt = "Wgt", 
                          cumwt = "Cumulative Wgt"), {
                 
  cumwt <- cumwt + wt
                 
})

df
#                    MPG Cylinders   Wgt Cumulative Wgt
# Mazda RX4         21.0         6 2.620          2.620
# Mazda RX4 Wag     21.0         6 2.875          5.495
# Datsun 710        22.8         4 2.320          7.815
# Hornet 4 Drive    21.4         6 3.215         11.030
# Hornet Sportabout 18.7         8 3.440         14.470
# Valiant           18.1         6 3.460         17.930
# Duster 360        14.3         8 3.570         21.500
# Merc 240D         24.4         4 3.190         24.690
# Merc 230          22.8         4 3.150         27.840
# Merc 280          19.2         6 3.440         31.280

# Example #6: Attributes and Arrays

# Create sample data
dat <- read.table(header = TRUE, text = '
   Year  Q1   Q2  Q3  Q4
   2000 125  137 152 140
   2001 132  145 138  87
   2002 101  104 115 121')
 
# Use attrib list to control column order and add labels
# Use array to calculate row sums and means, and get best quarter
df <- datastep(dat,
               attrib = list(Tot = dsattr(0, label = "Year Total"),
                             Avg = dsattr(0, label = "Year Average"),
                             Best = dsattr(0, label = "Best Quarter")),
               arrays = list(qtrs = dsarray("Q1", "Q2", "Q3", "Q4")),
               drop = "q",
               steps = {
               
                 # Empty brackets return all array values
                 Tot <- sum(qtrs[])
                 Avg <- mean(qtrs[])
                 
                 # Iterate to find best quarter
                 for (q in qtrs) {
                   if (qtrs[q] == max(qtrs[]))
                     Best <- q
                 }
               })
               
df
#   Year  Q1  Q2  Q3  Q4 Tot    Avg Best
# 1 2000 125 137 152 140 554 138.50   Q3
# 2 2001 132 145 138  87 502 125.50   Q2
# 3 2002 101 104 115 121 441 110.25   Q4

dictionary(df)
#   A tibble: 8 x 10
#   Name  Column Class     Label        Description Format Width Justify  Rows   NAs
#   <chr> <chr>  <chr>     <chr>        <chr>       <lgl>  <int> <chr>   <int> <int>
# 1 df    Year   integer   NA           NA          NA        NA NA          3     0
# 2 df    Q1     integer   NA           NA          NA        NA NA          3     0
# 3 df    Q2     integer   NA           NA          NA        NA NA          3     0
# 4 df    Q3     integer   NA           NA          NA        NA NA          3     0
# 5 df    Q4     integer   NA           NA          NA        NA NA          3     0
# 6 df    Tot    integer   Year Total   NA          NA        NA NA          3     0
# 7 df    Avg    numeric   Year Average NA          NA        NA NA          3     0
# 8 df    Best   character Best Quarter NA          NA         2 NA          3     0
# }

Run the code above in your browser using DataLab