unf: universal numeric fingerprint

Description

A universal numeric fingerprint is used to guarantee that a defined subset of data is substantively identical to a comparison subset. Two fingerprints will match if and only if the subset of data generating them are identical, when represented using a given number of significant digits.

Usage

unf(data, digits = NULL, ndigits = { if (is.null(digits))
                 { 8 } else (digits)}, cdigits = { if (is.null(digits))
                 { 128 } else (digits)}, version = 4.1, rowIndexVar =
                 NULL, rowOrder = { if (is.null(rowIndexVar)) { NULL }
                 else { order(rowIndexVar) }})
	unf2base64 (x)
	as.character.unf(x)
	as.unf(char)

Arguments

data

A numeric or charactervector or data frame. Other types will be computed.

digits

number of digits to use, see cdigits and ndigits

ndigits

number of significant digits for rounding for numeric values prior to applying cryptographic hash

cdigits

number of characters for truncation prior to applying cryptographic hash

version

algorithmic version. Always use the same version of the algorithm to check a signature.

rowIndexVar

a vector of rowids. The resulting data will be sorted by this vector before the UNF's are computed. This will affect the UNF for each vector. This is equivalent to unf(df[order(rowIndexVar),]

rowOrder

explicit sort ordering, an alternative to using rowIndexVar

a unf object, returned by unf

char

a character vector of UNF character strings

Value

The unf function returns a UNF object which can be converted using as.character to a signature string. For example: UNF:3:10,128:ZNQRI14053UZq389x0Bffg== This representation identifies the signature as a fingerprint, using version 3, of the algorithm, computed to 10 significant digits for numbers and 128 for characters. The segment following the final colon is the actual fingerprint in base64 encoded format. Note: to compare two UNF's, or sets of unfs, one often wants to compare only the base64 portion. Use unf2base64 for this, which will extract the base64 portion. Use summary to produce a single UNF from set of vectors, by computing a new UNF across the base64 strings. The order in which the set of vectors is important.

Details

A UNF is created by rounding data values (or truncating strings) to a known number of digits (characters), representing those values in standard form (as 32bit unicode-formatted strings), and applying a fingerprinting method (such as cryptographic hashing function) to this representation. UNF's are computed from data values provided by the statistical package, so they directly reflect the internal representation of the data -- the data as the statistical package interprets it. A UNF differs from an ordinary file checksum in several important ways: 1. UNF's are format independent. The UNF for the data will be the same regardless of whether the data is saved as a R binary format, SAS formatted file, Stata formatted file, etc., but file checksums will differ. 2. UNF's are robust to insignificant rounding error. A UNF will also be the same if the data differs in non-significant digits, a file checksum not. 3.UNF's detect misinterpretation of the data by the statistical software. If the statistical software misreads the file, the resulting UNF will not match the original, but the file checksums may match. 4.UNF's are strongly tamper resistant. Any accidental or intentional changes to the data values will change the resulting UNF. Most file checksums's and descriptive statistics detect only certain types of changes. UNF libraries are available for standalone use, for use in C++, and for use with other packages.

References

Altman, M., J. Gill and M. P. McDonald. 2003. Numerical Issues in Statistical Computing for the Social Scientist. John Wiley & Sons. http://www.hmdc.harvard.edu/numerical_issues/

Examples

Run this code

# simple example
v=1:100/10 +.0111 
vr=signif(v,digits=2)

# print.unf shows in  standard format, including version and digits
print(unf(v))

# as.character will return base64 section only for comparisons
as.character(unf(v))

# this is false,  since computed  base64 values UNF's differ
unf2base64(unf(v))==unf2base64(unf(vr))

# this is true,  since computed UNF's base64 values are the same at 2 significant digits
unf2base64(unf(v, digits=2))==unf2base64(unf(vr))

# WARNING: this is false, since UNF's values are the same, but 
# number of calculated digits differ , probably not the comparison
# you intend

identical(unf(v,digits=2),unf(vr))

# compute a fingerprint of longley at 10 significant digits of accuracy for numeric values
# this fingerprint can be stored and verified when reading the dataset
# later
data(longley)
mf10<-unf(longley,ndigits=10);

# this produces the same results as using signifz(), but not signif()
mf11<-unf(signifz(longley,digits=10))

unf2base64(mf11)==unf2base64(mf10)

#printable representation, prints seven UNF's, one for each vector
print(mf10)

#  summarizes the base64 portion of the unf for each vector into a 
# single  base64 UNF representing entire dataset
summary(mf10)
#self test

unfTest=get("unfTest",envir=environment(unf))
if (!unfTest(silent=FALSE)) {
	stop("failed self tests")
}

Run the code above in your browser using DataLab