Learn R Programming

Hmisc (version 2.2-3)

upData: Update a Data Frame or Cleanup a Data Frame after Importing

Description

cleanup.import will correct errors and shrink the size of data frames created by the S-Plus File ...Import dialog or by other methods such as scan and read.table. By default, double precision numeric variables are changed to single precision (S-Plus only) or to integer when they contain no fractional components. Infinite values or values greater than 1e20 in absolute value are set to NA. This solves problems of importing Excel spreadsheets that contain occasional character values for numeric columns, as S-Plus converts these to Inf without warning. There is also an option to convert variable names to lower case and to add labels to variables. The latter can be made easier by importing a CNTLOUT dataset created by SAS PROC FORMAT and using the sasdict option as shown in the example below. cleanup.import can also transform character or factor variables to dates.

upData is a function facilitating the updating of a data frame without attaching it in search position one. New variables can be added, old variables can be modified, variables can be removed or renamed, and "labels" and "units" attributes can be provided. Various checks are made for errors and inconsistencies, with warnings issued to help the user. Levels of factor variables can be replaced, especially using the list notation of the standard merge.levels function. Unless force.single is set to FALSE, upData also converts double precision vectors to single precision (if not under R), or to integer if no fractional values are present in a vector.

Both cleanup.import and upData will fix a problem with data frames created under S-Plus before version 5 that are used in S-Plus 5 or later. The problem was caused by use of the label function to set a variable's class to "labelled". These classes are removed as the S version 4 language does not support multiple inheritance. Failure to run data frames through one of the two functions when these conditions apply will result in simple numeric variables being set to factor in some cases. Extraneous "AsIs" classes are also removed.

For S-Plus, a function exportDataStripped is provided that allows exporting of data to other systems by removing attributes label, imputed, format, units, and comment. It calls exportData after stripping these attributes. Otherwise exportData will fail.

csv.get reads comma-separated text data files, allowing optional translation to lower case for variable names after making them valid S names. Original possibly non-legal names are taken to be variable labels. Character or factor variables containing dates can be converted to date variables. cleanup.import is invoked to finish the job.

Usage

cleanup.import(obj, labels, lowernames=FALSE, 
               force.single=TRUE, force.numeric=TRUE, rmnames=TRUE,
               big=1e20, sasdict, pr, datevars=NULL, dateformat='
upData(object, ..., 
       rename, drop, labels, units, levels,
       force.single=TRUE, lowernames=FALSE, moveUnits=FALSE)

exportDataStripped(data, ...)

csv.get(file, lowernames=FALSE, datevars=NULL, dateformat='%F', allow=NULL, ...)

Arguments

obj
a data frame or list
object
a data frame or list
data
a data frame
force.single
By default, double precision variables are converted to single precision (in S-Plus only) unless force.single=FALSE. force.single=TRUE will also convert vectors having only integer values to have a storage mode of integer, in R o
force.numeric
Sometimes importing will cause a numeric variable to be changed to a factor vector. By default, cleanup.import will check each factor variable to see if the levels contain only numeric values and "". In that case, the variable
rmnames
set to `F' to not have `cleanup.import' remove `names' or `.Names' attributes from variables
labels
a character vector the same length as the number of variables in obj. These character values are taken to be variable labels in the same order of variables in obj. For upData, labels is a named list or
lowernames
set this to TRUE to change variable names to lower case. upData does this before applying any other changes, so variable names given inside arguments to upData need to be lower case if lowernames==TRUE.
big
a value such that values larger than this in absolute value are set to missing by cleanup.import
sasdict
the name of a data frame containing a raw imported SAS PROC CONTENTS CNTLOUT= dataset. This is used to define variable names and to add attributes to the new data frame specifying the original SAS dataset name and label.
pr
set to TRUE or FALSE to force or prevent printing of the current variable number being processed. By default, such messages are printed if the product of the number of variables and number of observations in obj exc
datevars
character vector of names (after lowernames is applied) of variables to consider as a factor or character vector containing dates in a format matching dateformat. The default is "%F" which uses the yyyy-mm-dd fo
dateformat
for cleanup.import is the input format (see strptime)
...
for upData, one or more expressions of the form variable=expression, to derive new variables or change old ones. For exportDataStripped, optional arguments that are passed to exportData. For csv.g
rename
list or named vector specifying old and new names for variables. Variables are renamed before any other operations are done. For example, to rename variables age and sex to respectively Age and gender,
drop
a vector of variable names to remove from the data frame
units
a named vector or list defining "units" attributes of variables, in no specific order
levels
a named list defining "levels" attributes for factor variables, in no specific order. The values in this list may be character vectors redefining levels (in order) or another list (see merge.levels if using S-Plus).
moveUnits
set to TRUE to look for units of measurements in variable labels and move them to a "units" attribute. If an expression in a label is enclosed in parentheses or brackets it is assumed to be units if moveUnits=TRUE
file
a file name to import
allow
a vector of characters allowed by Rthat should not be converted to periods in variable names. By default, underscores in variable names are converted to periods as with Rbefore version 1.9.

Value

  • a new data frame

See Also

sas.get, data.frame, describe, label, read.csv, strptime, POSIXct,Date

Examples

Run this code
dat <- read.table('myfile.asc')
dat <- cleanup.import(dat)
dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3)
dat2 <- upData(dat, x=x^2, x=x-5, m=x/10, 
               rename=c(a='x'), drop='z',
               labels=c(x='X', y='test'),
               levels=list(y=list(a='a',b=c('b1','b2'))))
dat2
describe(dat2)
dat <- dat2    # copy to original name and delete dat2 if OK
rm(dat2)

# If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict,
# the LABELs from this dataset can be added to the data.  Let's also
# convert names to lower case for the main data file
mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict)

Run the code above in your browser using DataLab