parseName: Parse surname and given name

Description

Identify the presumed surname in a character string assumed to represent a name and return the result in a character matrix with "surname" followed by "givenName". If only one name is provided (without punctuation), it is assumed to be the givenName; see Wikipedia, "Given name" and "Surname".

Usage

parseName(x, surnameFirst=(median(regexpr(',', x))>0),
          suffix=c('Jr.', 'I', 'II', 'III', 'IV', 
              'Sr.', 'Dr.', 'Jr', 'Sr'),
          fixNonStandard=subNonStandardNames, 
          removeSecondLine=TRUE, 
          namesNotFound="attr.replacement", ...)

Arguments

a character vector

surnameFirst

logical: If TRUE, the surname comes first followed by a comma (","), then the given name. If FALSE, parse the surname from a standard Western "John Smith, Jr." format. If missing(surnameFirst), use TRUE if half of the elements of x contain a comma.

suffix

character vector of strings that are NOT a surname but might appear at the end without a comma that would otherwise identify it as a suffix.

fixNonStandard

function to look for and repair nonstandard names such as names containing characters with accent marks that are sometimes mangled by different software. Use identity if this is not desired.

removeSecondLine

logical: If TRUE, delete anything following "\n" and return it as an attribute "secondLine".

namesNotFound

character vector passed to subNonStandardNames and used to compute any "namesNotFound" attribute of the object returned by parseName.

…

optional arguments passed to fixNonStandard

Value

a character matrix with two columns: surname and givenName.

This matrix also has a "namesNotFound" attribute if one is returned by subNonStandardNames.

Details

If surnameFirst is FALSE:

1. If the last character is ")" and the matching "(" is 3 characters earlier, drop all that stuff. Thus, "John Smith (AL)" becomes "John Smith".

2. Look for commas to identify a suffix like Jr. or III; remove and call the rest x2.

3. split <- strsplit(x2, " ")

4. Take the last as the surname.

5. If the "surname" found per 3 is in suffix, save to append it to the givenName and recurse to get the actual surname.

NOTE: This gives the wrong answer with double surnames written without a hyphen in the Spanish tradition, in which, e.g., "Anastasio Somoza Debayle", "Somoza Debayle" give the (first) surnames of Anastasio's father and mother, respectively: The current algorithm would return "Debayle" as the surname, which is incorrect.

6. Recompose the rest with any suffix as the givenName.

Examples

Run this code

# NOT RUN {
##
## 1.  Parse standard first-last name format
##
tstParse <- c('Joe Smith (AL)', 'Teresa Angelica Sanchez de Gomez',
         'John Brown, Jr.', 'John Brown Jr.',
         'John W. Brown III', 'John Q. Brown,I',
         'Linda Rosa Smith-Johnson', 'Anastasio Somoza Debayle',
         'Ra_l Vel_zquez', 'Sting', 'Colette, ')
# }
# NOT RUN {
<!-- %library(Ecdat) -->
# }
# NOT RUN {
parsed <- parseName(tstParse)

tstParse2 <- matrix(c('Smith', 'Joe', 'Gomez', 'Teresa Angelica Sanchez de',
  'Brown', 'John, Jr.', 'Brown', 'John, Jr.',
  'Brown', 'John W., III', 'Brown', 'John Q., I',
  'Smith-Johnson', 'Linda Rosa', 'Debayle', 'Anastasio Somoza',
  'Velazquez', 'Raul', '', 'Sting', 'Colette', ''),
  ncol=2, byrow=TRUE)
# NOTE:  The 'Anastasio Somoza Debayle' is in the Spanish tradition
# and is handled incorrectly by the current algorithm.
# The correct answer should be "Somoza Debayle", "Anastasio".
# However, fixing that would complicate the algorithm excessively for now.
colnames(tstParse2) <- c("surname", 'givenName')

# }
# NOT RUN {
all.equal(parsed, tstParse2)
# }
# NOT RUN {
##
## 2.  Parse "surname, given name" format
##
tst3 <- c('Smith (AL),Joe', 'Sanchez de Gomez, Teresa Angelica',
     'Brown, John, Jr.', 'Brown, John W., III', 'Brown, John Q., I',
     'Smith-Johnson, Linda Rosa', 'Somoza Debayle, Anastasio',
     'Vel_zquez, Ra_l', ', Sting', 'Colette,')
tst4 <- parseName(tst3)

tst5 <- matrix(c('Smith', 'Joe', 'Sanchez de Gomez', 'Teresa Angelica',
  'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I',
  'Smith-Johnson', 'Linda Rosa', 'Somoza Debayle', 'Anastasio',
  'Velazquez', 'Raul', '','Sting', 'Colette',''),
  ncol=2, byrow=TRUE)
colnames(tst5) <- c("surname", 'givenName')

# }
# NOT RUN {
all.equal(tst4, tst5)
# }
# NOT RUN {
##
## 3.  secondLine 
##
L2 <- parseName(c('Adam\n2nd line', 'Ed  \n --Vacancy', 'Frank'))

# check 
L2. <- matrix(c('', 'Adam', '', 'Ed', '', 'Frank'), 
              ncol=2, byrow=TRUE)
colnames(L2.) <- c('surname', 'givenName')
attr(L2., 'secondLine') <- c('2nd line', ' --Vacancy', NA)
# }
# NOT RUN {
all.equal(L2, L2.)
# }
# NOT RUN {
##
## 4.  Force surnameFirst when in a minority 
##
snf <- c('Sting', 'Madonna', 'Smith, Al')
SNF <- parseName(snf, surnameFirst=TRUE)

# check 
SNF2 <- matrix(c('', 'Sting', '', 'Madonna', 'Smith', 'Al'), 
               ncol=2, byrow=TRUE)
colnames(SNF2) <- c('surname', 'givenName')               
# }
# NOT RUN {
all.equal(SNF, SNF2)
# }
# NOT RUN {
##
## 5.  nameNotFound
##
noSub <- parseName('xx_x')

# check 
noSub. <- matrix(c('', 'xx_x'), 1)
colnames(noSub.) <- c('surname', 'givenName')               
attr(noSub., 'namesNotFound') <- 'xx_x'
# }
# NOT RUN {
all.equal(noSub, noSub.)
# }

Run the code above in your browser using DataLab