Creates properly sized clusters for matching, using either
alphabetical or word embedding clustering. If using word embedding,
the function first creates a word embedding out of the provided
vectors, and then runs PCA on the matrix. It then takes the first
k
dimensions (where k
is provided by the user) and
k-means is run on that matrix to get the clusters.
clusterMatch(vecA, vecB, nclusters, max.n, word.embed, min.var, iter.max)
clusterMatch
returns a list of length 3:
The cluster assignments for dataset A
The cluster assignments for dataset B
The number of clusters created
The k-means object output.
The PCA object output.
The number of dimensions from PCA used for the k-means clustering.
The character vector from dataset A
The character vector from dataset B
The number of clusters to create from the provided data. Either nclusters = NULL or max.n = NULL.
The maximum size of either dataset A or dataset B in the largest cluster. Either nclusters = NULL or max.n = NULL
Whether to use word embedding clustering. Default is FALSE.
The minimum amount of explained variance (maximum = 1) a PCA dimension can provide in order to be included in k-means clustering when using word embedding. Default is .20.
Maximum number of iterations for the k-means algorithm.
Ben Fifield <benfifield@gmail.com>
data(samplematch)
cl <- clusterMatch(dfA$firstname, dfB$firstname, nclusters = 3)
Run the code above in your browser using DataLab