duplicated()
determines which elements of a vector or data frame are duplicates
of elements with smaller subscripts, and returns a logical vector
indicating which elements (rows) are duplicates. anyDuplicated(.)
is a generalized more efficient
shortcut for any(duplicated(.))
.
duplicated(x, incomparables = FALSE, ...)
"duplicated"(x, incomparables = FALSE, fromLast = FALSE, nmax = NA, ...)
"duplicated"(x, incomparables = FALSE, MARGIN = 1, fromLast = FALSE, ...)
anyDuplicated(x, incomparables = FALSE, ...)
"anyDuplicated"(x, incomparables = FALSE, fromLast = FALSE, ...)
"anyDuplicated"(x, incomparables = FALSE, MARGIN = 1, fromLast = FALSE, ...)
NULL
.FALSE
is a special value, meaning that all values can be
compared, and may be the only value accepted for methods other than
the default. It will be coerced internally to the same type as
x
.duplicated = FALSE
.apply
, and note that MARGIN = 0
maybe useful.vector
) or differ only
in their attributes. In the worst case it is $O(n^2)$. For the default methods, and whenever there are equivalent method
definitions for duplicated
and anyDuplicated
,
anyDuplicated(x, ...)
is a generalized shortcut for
any(duplicated(x, ...))
, in the sense that it returns the
index i
of the first duplicated entry x[i]
if
there is one, and 0
otherwise. Their behaviours may be
different when at least one of duplicated
and
anyDuplicated
has a relevant method.
duplicated(x, fromLast = TRUE)
is equivalent to but faster than
rev(duplicated(rev(x)))
.
The data frame method works by pasting together a character
representation of the rows separated by \r
, so may be imperfect
if the data frame has characters with embedded carriage returns or
columns which do not reliably map to characters.
The array method calculates for each element of the sub-array
specified by MARGIN
if the remaining dimensions are identical
to those for an earlier (or later, when fromLast = TRUE
) element
(in row-major order). This would most commonly be used to find
duplicated rows (the default) or columns (with MARGIN = 2
).
Note that MARGIN = 0
returns an array of the same
dimensionality attributes as x
.
Missing values are regarded as equal, but NaN
is not equal to
NA_real_
.
Values in incomparables
will never be marked as duplicated.
This is intended to be used for a fairly small set of values and will
not be efficient for a very large set.
When used on a data frame with more than one column, or an array or matrix when comparing dimensions of length greater than one, this tests for identity of character representations. This will catch people who unwisely rely on exact equality of floating-point numbers!
Character strings will be compared as byte sequences if any input is
marked as "bytes"
(see Encoding
).
Except for factors, logical and raw vectors the default nmax = NA
is
equivalent to nmax = length(x)
. Since a hash table of size
8*nmax
bytes is allocated, setting nmax
suitably can
save large amounts of memory. For factors it is automatically set to
the smaller of length(x)
and the number of levels plus one (for
NA
). If nmax
is set too small there is liable to be an
error: nmax = 1
is silently ignored.
Long vectors are supported for the default method of
duplicated
, but may only be usable if nmax
is supplied.
unique
.x <- c(9:20, 1:5, 3:7, 0:8)
## extract unique elements
(xu <- x[!duplicated(x)])
## similar, same elements but different order:
(xu2 <- x[!duplicated(x, fromLast = TRUE)])
## xu == unique(x) but unique(x) is more efficient
stopifnot(identical(xu, unique(x)),
identical(xu2, unique(x, fromLast = TRUE)))
duplicated(iris)[140:143]
duplicated(iris3, MARGIN = c(1, 3))
anyDuplicated(iris) ## 143
anyDuplicated(x)
anyDuplicated(x, fromLast = TRUE)
Run the code above in your browser using DataLab