Learn R Programming

Ecfun (version 0.3-2)

asNumericDF: Coerce to numeric dropping commas and info after a blank

Description

For asNumericChar, delete leading blanks and a leading dollar sign plus commas (thousand separators) and drop information after a blank (other than leading blanks), then coerce to numeric or to factors, Dates, or POSIXct as desired.

For a data.frame, apply asNumericChar to all columns and drop columns except those in keep, ignore, factors, Dates, POSIX and MSdates.

Then order the rows by the orderBy column. Some Excel imports include commas as thousand separators; this replaces any commas with char(0), '', before trying to convert to numeric.

Similarly, if "%" is found as the last character in any field, drop the percent sign and divide the resulting numeric conversion by 100 to convert to proportion.

Also, some character data includes footnote references following the year.

For example Table F-1 from the US Census Bureau needs all three of these numeric conversion features: It needs orderBy, because the most recent year appears first, just the opposite of most other data sets where the most recent year appears last. It has footnote references following a character string indicating the year. And it includes commas as thousand separators.

Usage

asNumericChar(x, leadingChar='^\\$', 
    suppressChar=',', pctChar='%$', 
    class.=NULL, format.=NULL)
asNumericDF(x, keep=
    function(x)any(!is.na(x)), 
  orderBy=NA, ignore=NULL, factors=NULL, 
  Dates=NULL, POSIX=NULL, MSdates=NULL, 
  format.=NULL, leadingChar='^\\$', 
  suppressChar=',', pctChar='%$')

Value

a data.frame

Arguments

x

For asNumericChar, this is a character vector to be converted to numeric after gsub(',', '', x).

For asNumericDF, this is a data.frame with all character columns to be converted to numerics.

keep

something to indicate which columns to keep, in addition to columns specified in ignore, factors, Dates, and POSIX.

orderBy

Which columns to order the rows of x[, keep] by. Default is to keep the input order.

ignore

vector identifying columns of x to ignore, i.e., to keep and not attempt to convert to another data type.

factors

vector indicating columns of x to convert to factor

Dates

vector indicating columns of x to convert using as.Date(, format).

POSIX

vector indicating columns of x to convert using as.POSIXct(, format).

class.

Desired class of output. Default is numeric.

format.

Character vector of length 1 to pass as argument format to as.Date and / or as.POSIXct for conversion from character.

For Dates, as.Date is first tried with format = '%Y-%m-%d', then with '%Y/%m/%d', '%m-%d-%Y', and '%m/%d/%Y'. The conversion with the fewest NAs is kept. If two match for numbers of NAs, the one with the minimum absolute deviations from as.Date1970(0) is used.

MSdates

The names or numbers identifying columns of x identifying dates as integer numbers of days since 1899-12-31. In Microsoft Excel, dates are stored in that format.

leadingChar

A regular expression passed to grep and sub to replace something like an initial dollar sign with character(0).

suppressChar

a regular expression passed to gsub to replace all occurrences of something like "," (a thousands separator in the U.S.) with character(0).

pctChar

A regular expression passed to grep to identify percent columns. pctChar is then passed to sub to replace pctChar with character(0), and the converted numbers are then divided by 100 to convert them to proportions.

Author

Spencer Graves

Details

For asNumericChar:

1. Replace commas by nothing

2. strsplit on ' ' and take only the first part, thereby eliminating the footnote references.

3. Replace any blanks with NAs

4. as.numeric

for asNumericDF:

1. Copy x to X.

2. Confirm that ignore, factors, Dates, and POSIX all refer to columns of x and do not overlap. [*** NOTE: as of 2016-07-21, these checks have only been implemented for ignore.]

3. Convert factors, Dates, and POSIX.

4. Apply asNumericChar to all columns not in ignore, factors, Dates, or POSIX.

5. Keep columns specified by keep.

6. return the result.

References

"Add (sum) or subtract dates; Applies To: Excel 2013". Microsoft. (accessed 2016-08-11)

See Also

Examples

Run this code
##
## 1.  an example 
##
(xDate <- as.Date('1970-01-01')+c(0, 365))
(xPOSIX <- as.POSIXct(xDate)+c(1, 99))
xMSdate <- as.Date(1, 
    as.Date('1899-12-31'))+1:2
(fakeF1 <- data.frame(yr=c('1948', 
            '1947 (1)'),
      q1=c(' 1,234 ', ''), duh=rep(NA, 2), 
      dol=c('$1,234', ''), 
      pct=c('1%', '2%'), 
      xDate=as.character(xDate, 
        format='%Y-%m-%d'), 
      xPOSIX=as.character(xPOSIX, 
        format='%Y-%m-%d %H:%M:%S'), 
      xMSdate=2:3, junk=c('this is',
          'junk')))
            
# This converts the last 3 columns to NAs and drops them:

str(nF1.1 <- asNumericChar(fakeF1$yr))
str(nF1.2 <- asNumericChar(fakeF1$q1))
str(nF1.3 <- asNumericChar(fakeF1$duh))
(nF1.4 <- asNumericChar('1969-12-31 18:00:01',
                        class.='POSIXct'))

(nF1 <- asNumericDF(fakeF1))
(nF2 <- asNumericDF(fakeF1, Dates=6, 
    MSdate='xMSdate', 
    ignore=c('junk', 'xPOSIX'), 
    format.='%Y-%m-%d'))
# check 
nF1. <- data.frame(yr=
        asNumericChar(fakeF1$yr),
      q1=asNumericChar(fakeF1$q1), 
      dol=asNumericChar(fakeF1$dol), 
      pct=c(.01, .02), xMSdate=2:3)

nF1c <- data.frame(yr=1948:1947, 
      q1=c(1234, NA), dol=c(1234, NA), 
      pct=c(.01, .02), xMSdate=2:3)

stopifnot(
all.equal(nF1, nF1.)
)
stopifnot(
all.equal(nF1., nF1c)
)

## 
## 2.  as.Date default example
##
  xD <- asNumericChar(
    as.character(xDate), class.='Date')
stopifnot(
  all.equal(xDate, xD)
)

##
## 3.  as.POSIXct default example 
##
xPOSIX
(xPOSIXch <- as.character(xPOSIX))
(xP <- asNumericChar(xPOSIXch, class.='POSIXct'))
attr(xPOSIX, 'tzone')
attr(xP, 'tzone')
# R-Devel after 4.2.1 breaks earlier code; fix
if(is.null(attr(xPOSIX, 'tzone')))
    attr(xPOSIX, 'tzone') <- attr(xP, 'tzone')
(dP <- difftime(xPOSIX, xP, units='secs'))
(madP <- max(abs(as.numeric(dP))))
stopifnot(
{
#all.equal(xPOSIX, xP)
# As of 2022-10-06 I don't know how to write code
# that will get a consistent answer with 
# different version R-devel with differences
# less than an hour
if(madP>3600){
  stop('Discrepancy betw fn and manual comp ', 
       'too large.')
}
TRUE
}
)

##
## 4.  orderBy=1:2
##
nF. <- asNumericDF(fakeF1, orderBy=1:2)

stopifnot(
all.equal(nF., nF1c[2:1,])
)

##
## 5.  Will it work for a tibble?  
##
if(require(tibble)){
  nF1t <- asNumericDF(as_tibble(fakeF1))
stopifnot(
  all.equal(nF1, nF1t)
)
}

Run the code above in your browser using DataLab