match.data.frame: Identify the row of `y` best matching each row of `x`

Description

For each row of x[, by.x], find the best matching row of y[, by.y], with the best match defined by grep. and split.

grep. and split must either be missing or have the same length as by.x and by.y. If grep.[i] and split[i] are NA, do a complete match of x[, by.x[i]] and y[, by.y[i]]. Otherwise, for each row j, look for a match for strsplit(x[j, by.x[i]], split[i])[[1]][1] among strsplit(y[, by.y[i]], split[i]). See details.

Usage

match.data.frame(x, y, by, by.x=by, by.y=by, 
        grep., split, sep=':')

Value

an integer vector of length nrow(x)

containing the index of the best matching row of y or NA if no adequate match was found.

Arguments

x, y

data.frames

by, by.x, by.y

names of columns of x and y to match.

grep.

a character vector of the type of match for each element of by.x and by.y. If NA, require a perfect match.

Alternatives are grep and agrep to find a match for the first segment in strsplit(x, split=split[i]) among any of the segments of strsplit(y, split=split[i]). Use fixed=TRUE with the calls to these functions.

NOTE: These alternatives are not examined if a unique match is found between x[, by.x[is.na(grep.) & is.na(split)]] and the corresponding columns of y.

split

A character vector of split characters to pass to strsplit; strsplit is not called if is.na(split).

sep

a sep argument to use with paste to produce a matching key for the columns of x and y for which perfect matches are required. If(missing(sep) && not(missing(grep.))) sep <- ' ' except where grep. = NAs.

Author

Spencer Graves

Details

1. Check by.x, by.y, grep. and split. If((missing(by.x) | missing(by.y)) && missing(by)) by <- names(x)

2. fullMatch <- (is.na(grep.) & is .na(split)). Create keyfx and keyfy by by pasting columns of x[, by.x[fullMatch]] and y[, by.y[fullMatch]]. Also create x. and y. = strsplit of x[, by.x[!fullMatch]].

3. Iterate over rows of x looking for the best match. This includes an inner loop over columns of x[, by.x[!fullMatch]], stopping on the first unique match. Return (-1) if no unique match is found.

Examples

Run this code

newdata <- data.frame(state=c("AL", "MI","NY"),
                      surname=c("Rogers", "Rogers", "Smith"),
                      givenName=c("Mike R.", "Mike K.", "Al"),
                      stringsAsFactors=FALSE)
reference <- data.frame(state=c("NY", "NY", "MI", "AL", "NY", "MI"),
                      surname=c("Smith", "Rogers", "Rogers (MI)",
                                "Rogers (AL)", "Smith", 'Jones'),
                      givenName=c("John", "Mike", "Mike", "Mike",
                                "T. Albert", 'Al Thomas'),
                      stringsAsFactors=FALSE)
newInRef <- match.data.frame(newdata, reference,
       grep.=c(NA, 'agrep', 'agrep'))

stopifnot(
all.equal(newInRef, c(4, 3, 5))
)

Run the code above in your browser using DataLab