Learn R Programming

sjmisc (version 1.2)

group_str: Group near elements of string vectors

Description

This function groups elements of a string vector (character or string variable) according to the element's distance ('similatiry'). The more similar two string elements are, the higher is the chance to be combined into a group.

Usage

group_str(strings, maxdist = 2, method = "lv", strict = FALSE,
  trim.whitespace = TRUE, remove.empty = TRUE, showProgressBar = FALSE)

Arguments

strings
Character vector with string elements.
maxdist
Maximum distance between two string elements, which is allowed to treat two elements as similar or equal.
method
Method for distance calculation. The default is "lv". See stringdist for details.
strict
Logical; if TRUE, value matching is more strictly. See 'Examples'.
trim.whitespace
Logical; if TRUE (default), leading and trailing white spaces will be removed from string values.
remove.empty
Logical; if TRUE (default), empty string values will be removed from the character vector strings.
showProgressBar
Logical; if TRUE, the progress bar is displayed when computing the distance matrix. Default in FALSE, hence the bar is hidden.

Value

  • A character vector where similar string elements (values) are recoded into a new, single value. The return value is of same length as strings, i.e. grouped elements appear multiple times, so the count for each grouped string is still avaiable (see 'Examples').

See Also

str_pos

Examples

Run this code
oldstring <- c("Hello", "Helo", "Hole", "Apple",
               "Ape", "New", "Old", "System", "Systemic")
newstring <- group_str(oldstring)

# see result
newstring

# count for each groups
table(newstring)

library(sjPlot)
# print table to compare original and grouped string
sjt.frq(data.frame(oldstring, newstring),
        removeStringVectors = FALSE,
        autoGroupStrings = FALSE)

# larger groups
newstring <- group_str(oldstring, maxdist = 3)
sjt.frq(data.frame(oldstring, newstring),
        removeStringVectors = FALSE,
        autoGroupStrings = FALSE)

# be more strict with matching pairs
newstring <- group_str(oldstring, maxdist = 3, strict = TRUE)
sjt.frq(data.frame(oldstring, newstring),
        removeStringVectors = FALSE,
        autoGroupStrings = FALSE)

Run the code above in your browser using DataLab