gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE)
numeric
will be considered as interval scaled variables; columns of mode character
or class fact
data.x
. Dissimilarities between rows of data.x
and rows of data.y
will be computed. If not provided, by default it is assumed equadata.x
. In correspondence of nonnumeric variables, just put 1 or NA
. When rngs=NULL
(default) the range of a numericTRUE
(default) the extension of the Gower's dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when
KR.corr=FALSE
, the Gower's (1971) formula is considered.matrix
object with distances among rows of data.x
and those of data.y
.KR.corr=TRUE
) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used. The final dissimilarity between the ith and jth unit is obtained as a weighted sum of dissimilarities for each variable: $$d(i,j) = \frac{\sum_k{\delta_{ijk} d_{ijk}}}{\sum_k{\delta_{ijk}}}$$
In particular, $d_{ijk}$ represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:
logical
columns are considered as asymmetric binary variables, for such case$d_{ijk}=0$if$x_{ik} = x_{jk} = \code{TRUE}$, 1 otherwise;factor
orcharacter
columns are considered as categorical nominal variables and$d_{ijk}=0$if$x_{ik}=x_{jk}$, 1 otherwise;numeric
columns are considered as interval-scaled variables and$$d_{ijk}=\frac{\left|x_{ik}-x_{jk}\right|}{R_k}$$being$R_k$the range of thekth variable. The range is the one supplied with the argumentrngs
(rngs[k]
) or the one computed on available data (whenrngs=NULL
);ordered
columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index,$r_{ik}$in the factor levels. WhenKR.corr=FALSE
these position indexes (that are different from the output of the R functionrank
) are transformed in the following manner$$z_{ik}=\frac{(r_{ik}-1)}{max\left(r_{ik}\right) - 1}$$These new values,$z_{ik}$, are treated as observations of an interval scaled variable.As far as the weight $\delta_{ijk}$ is concerned:
In practice, NAs
and couple of cases with $x_{ik}=x_{jk}=\code{FALSE}$ do not contribute to distance computation.
Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
daisy
,
dist
x1 <- as.logical(rbinom(10,1,0.5))
x2 <- sample(letters, 10, replace=TRUE)
x3 <- rnorm(10)
x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE))
xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)
# matrix of distances among observations in xx
gower.dist(xx)
# matrix of distances among first obs. in xx
# and the remaining ones
gower.dist(data.x=xx[1:3,], data.y=xx[4:10,])
Run the code above in your browser using DataLab