Learn R Programming

pvrank (version 1.1.2)

comprank: Computes various rank correlation coefficients

Description

Computes various rank order correlations with and withou ties present.

Usage

comprank(p, q = NULL, indr, tiex = "woodbury", sizer = 100000,
	 repgin = 1000, print = TRUE)

Arguments

p

a numeric vector, matrix or data frame.

q

NULL (default) or a vector with compatible dimensions to \(p\).

indr

a character string that specifies the rank correlation that is used in the test statistic. Acceptable values are: "spearman", "kendall", "gini","r4", "fy1" (Fisher-Yates - means), "fy2" (Fisher-Yates - medians), "sbz" (Symmetrical Borroni-Zenga). Only enough of the string to be unique is required.

tiex

a character string that specifies the method for breaking ties. Possible options are: "woodbury", "gh","wgh", "midrank","dubois". Only enough of the string to be unique is required. This option is ignored when ties are absent. The options: "midrank" and "dubois" are also ignored in the calculation of \(r_4\), Fisher-Yates means and Fisher-Yates medians.

sizer

number of replications for resolving ties by randomization. Default value: \(10^5\).

repgin

number of sampled pairs of permutations needed to apply the weighted max-min method (wgh). Default value: \(10^3\).

print

if FALSE suppresses some of the output.

Value

a list with the following components :

r

The value of the required rank correlation.

ities

Method for dealing with ties.

Details

Consider a fixed set of \(n\) distinct items ordered according to the different degree in which they possess two common attributes represented by \(p\) and \(q\). Let us suppose that each attribute consist of a host of intangibles that can be ranked but not measured and that the evaluations are expressed in terms of an ordinal scale of \(n\) ranks: \((p_i, q_i), i=1,2, ..., n\) where \(p\) is re-arranged in the natural order.

Gideon & Hollister (1987) pointed out that reasonable coefficients of rank correlation need to possess certain properties and gave a list of postulates for nonparametric measures of dependence based on works of R\(\acute{e}\)nyi (1959) and Schweizer & Wolff (1981). On this specific issue, see also Scarisini, (1984), King & Chinchilli (2001), and Genest & Plante (2003). In short, the properties that any index of rank correlation should satisfy are the following

  • \(r(p,q)\) is defined for any pair of permutations \(p,q\).

  • Comparability. \(-1 \le r(p,q) \le 1\) with \(r(p,p) = r(q,q) = 1\) and \(r(p,p^*) = r(q,q^*) = -1\) where \(p^* = n+1- p\) and \(q^* = n+1- q\). It is also required that increasing levels (decreasing levels) of agreement are reflected by increasing positive (negative) values.

  • Symmetry. \(r(p,q) = r(q,p)\).

  • Right-invariance. \(r(p[\theta],q[\theta]) = r(p,q)\) for all \(\theta, p, q \in S_n\), where \(S_n\) is the set of all \(n!\) permutations of integers \(1, 2,\cdots, n\).

  • Antisymmetry under reversal. \(r(p,q^*) = r(p^*,q) = -r(p,q^*)\).

  • Zero expected value under independence. \(E_{p,q \in S_n}[r(q,p)]=0\) when all rankings are equally probable with probability \(1/n!\)

pvrank includes six admissible rank correlations (in the sense that they have the desirable properties described above).

$$Spearman: r_1=1-(6/(n^3-n)S_1;\quad S_1=\sum_{i=1}^n (p_i-q_i)^2$$ $$Kendall: r_2=(2/(n^2-n))S_2;\quad S_2=\sum_{i=1}^{n-1}\sum_{j=i+1}^n sign(p_i-q_j)sign(p_i-q_j)$$ $$Gini: r_3=(2/(n^2-k_n))S_3;\quad S_3=\sum_{i=1}^n (|n+1-p_i-q_i|-|p_i-q_i|)$$ $$ r_4=\Big[\sum_{i=1}^ng_i(p,q^*)\Big]\Big[\sum_{i=1}^ng_j(p^*,q)\Big]-\Big[\sum_{i=1}^ng_i(p^*,q^*)\Big]\Big[\sum_{i=1}^ng_j(p,q)\Big]/M_n$$ where \(k_n\) is zero if \(n\) is even and one if \(n\) is odd. The function \(g\left(.\right)\) is defined as \(g_h(p_i,q_j)=max(p_i/q_j,q_j/p_i)\). The quantity \(M_n\) is given by $$M_n=\Big[k_n+2\sum_{i=1}^{n/2}(n+1-i)/i\Big]^2 -n^2$$

Three other admissible coefficients are:

$$FY - means: r_5=\frac{\sum_{i=1}^n \xi(p_i|n)\xi(q_i|n)}{\sum_{i=1}^n \xi(p_i|n)^2}$$ $$FY - medians: r_6=\frac{\sum_{i=1}^n \zeta(p_i|n)\zeta(q_i|n)}{\sum_{i=1}^n \zeta(p_i|n)^2}$$ $$Symmetrical BZ: r_7=1.5\frac{\sum_{i=1}^n\sum_{j=1}^n|(i+p_i)-(j+p_j|-|(i-p_i)-|(j-p_j)|)}{n(n^2-1)}$$

The expression \(\xi(p_i|n)\) denotes the expected values of the i-th largest standardized deviate in a sample of size \(n\) from a standad Gaussian population. See, for example, function evNormOrdStats in package EnvStats, Millard (2013). An implementation of the Gaussian rank correlation coefficient is given in the rococo package. See Bodenhofer et al. (2013). Finally, \(\zeta(p_i|n)\) is the i-th order statistic median from a standard Gaussian distribution. These medians \(\zeta(p_i|n)\) are exactly related to the order statistic medians \(m_i\) from a uniform distribution on \([0,1]\), that is, \(\zeta(p_i|n)=\Phi^{-1}(m_i)\) where \(\Phi^{-1}(x)\) is the quantile function of the standard Gaussian cumulative distribution function. The medians of the order statistics from a unit uniform distribution are given by \(m_i=B^{-1}(0.5,i,n+1-i)\) where \(B^{-1}(x,a,b)\) denotes the inverse Beta distribution with parametes \(a,b\) evaluated at the percentage \(x\). Fortunately, \(B^{-1}(,)\) is a standard function: qbeta in package stats. See Tarsitano & Amerise (2016) and Amerise & Tarsitano (2016).

In the absence of ties \(r_1, r_2, r_3\) can assume \((n^3-n)/6+1\), \((n^2-n)/2+1\), \((n^2-k_n)/2+1\) distinct values, respectively. The coefficient \(r_4\) can assume a number of different values of the order \(0.25n!\) more or less uniformly spaced from each other. When \(n>3\), \(r_1\) can be zero if, and only if, \(n\) is not of the form \(n=4m+2\) where \(m\) is a positive integer. The quantity \(S_2\) is even if, and only if, \(n=4m\) or \(n=4m+1\); \(S_2\) only takes on an odd value if \(n\) is not in that form. For \(n>3\), zero is always a value of \(r_3\). If \(n\ge5\), coefficient \(r_4\) fails to be zero for \(n=7\). The number of distinct values of \(r_5\) and \(r_6\) is only limited by the number of bits used in the representation of float-point values in a computer.

Equal values are common when rank methods are applied to rounded data or data consisting solely of small integers. A popular technique for resolving ties in rank correlation is the mid-rank method: the mean of the rankings remain unaltered, but the variance is reduced and changes according to the number and location of ties.

Woodbury (1940) notes that the quadratic mean has been proposed to preserve the sum of squares of the ranks: \(n(n+1)(2n+1)/6\). In this case the condition on the total variance is satisfied, but not that on the total mean, which turns out to be greater than \((n+1)/2\). The Woodbury method yields a corrected values of the rank correlation which could be regarded as the average of the values of the conventional coefficient obtained by all possible rankings of tied observations. In essence, the methods "gh" (see Gideon and Hollister, 1987) and "wgh" (see Gini, 1939) consider two special orderings with the objective to determine the pair of permutations that maximizes the direct association or positive correlation and the pair of permutations that maximizes the inverse association or, in absolute terms, the negative correlation. The method "gh" yields the simple average of the bounds; the method "wgh" yields a weighted average of the bounds. See Amerise and Tarsitano (2014).

References

Amerise, I. L. and Tarsitano, A. (2014). Correction methods for ties in rank correlations. Journal of Applied statistics, 42, 1--13.

Amerise, I. L. and Tarsitano, A. (2016). A new symmetrical test of bivariate independence. Submitted.

Borroni, G. C. (2013). "A new rank correlation measure". Statistical Papers, 54, 255--270.

Bodenhofer, U. et al. (2013). "Testing noisy numerical data for monotonic association". information Sciences, 245, 21--37.

Genest, C. and Plante, J.-F. (2003). "On Blest's measure of rank correlation". The Canadian Journal of Statistics, 31, 35--52.

Gideon, R. and Hollister, A. (1987). "A rank correlation coefficient resistant to outliers". Journal of the American Statistical Association, 82, 656--666.

Gini, C. (1939). "Sulla determinazione dell'indice di cograduazione". Metron, 13, 41--48.

King, T.S. and Chinchilli, V.M. (2001). "Robust estimators of the concordance correlation coefficient". Journal of Biopharmaceutical Statistics, 11, 83--105.

Millard, S.P. (2013). EnvStats: An R Package for Environmental Statistics. Springer, New York.

R\(\acute{e}\)nyi, A. (1959). "On measures of dependence". Acta Mathematica Hungarica, 10, 441--451.

Scarsini, M. (1984). "On measures of dependence". Stochastica, 8, 201--218.

Schweizer, B. and Wolff, E.F. (1981). "On nonparametric measures of dependence for random variables". Annals of Statistics, 9, 879--885.

Tarsitano, A. and Lombardo, R. (2013). "A coefficient of correlation based on ratios of ranks and anti-ranks". Jahrbucher fur Nationalokonomie und Statistik, 233, 206--224.

Tarsitano, A. and Amerise, I. L. (2016). "Effectiveness of rank correlation statistics in non-linear relationships". Submitted.

Woodbury, M. A. (1940). "Rank correlation when there are equal variates". The Annals of Mathematical Statistics 11, 358--362.

Examples

Run this code
# NOT RUN {
data(Zoutus);attach(Zoutus);print(Zoutus)
a<-comprank(LogDose,LogTime,"spearman","woodbury")
cat(a$r,a$ities,"\n")
a<-comprank(LogDose,LogTime,"kendall","woodbury")
cat(a$r,a$ities,"\n")
detach(Zoutus)
#####
#
# Yotopoulos, P. A.  Nugent, J. B. (1973). A balanced-growth version of the
# linkage hypothesis: a test. The Quarterly Journal of Economics, 87, 157-171.
x<-1:18
y<-c(7,1,4,8,3,9,2,5,10,6,17,13,14,12,11,16,15,18)
a<-comprank(x,y,"gini");cat(a$r,a$ities,"\n")
#####
#
data(Franzen);attach(Franzen)
op<-par(mfrow=c(1,1))
plot(MECISSP,PPP, main="Environmental Attitudes in International Comparison",
xlab="Mean Environmental concern ISSP", ylab="Purchasing power parity",
pch=19, cex=0.8,col="salmon3")
abline(h=mean(PPP),col="darkred",lty=2,lwd=1)
abline(v=mean(MECISSP),col="darkred",lty=2,lwd=1)
par(op)
a<-comprank(MECISSP,PPP,"kendall","gh");cat(a$r,a$ities,"\n")
a<-comprank(MECISSP,PPP,"gini","wgh");cat(a$r,a$ities,"\n")
detach(Franzen)
#####
#
data(Viscoh);attach(Viscoh)
Viscoh<-as.matrix(Viscoh)
a<-comprank(Viscoh,"spearman","gh",print=FALSE)
print(a$r);cat(" method:", a$ities,"\n")
b<-comprank(Viscoh,"r4","gh",print=FALSE)
print(b$r);cat(" method:", b$ities,"\n")
c<-comprank(Viscoh,"fy1","wgh",print=FALSE)
print(c$r);cat("method:", c$ities,"\n")
d<-comprank(Viscoh,"fy2","wgh",print=FALSE)
print(d$r);cat(" method:", d$ities,"\n")
d<-comprank(Viscoh,"sbz","wgh",print=FALSE)
print(d$r);cat(" method:", d$ities,"\n")
detach(Viscoh)
#####
#
data(Laudaher);attach(Laudaher)
a1<-comprank(Duration,Infiltration,"gini","midrank")
a2<-comprank(Duration,Infiltration,"gini","dubois")
a3<-comprank(Duration,Infiltration,"spearman","midrank")
a4<-comprank(Duration,Infiltration,"spearman","dubois")
cat("Coefficient","method","\n",a1$r,a1$ities,"\n",a2$r,a2$ities,"\n",a3$r,a3$ities,
"\n",a4$r,a4$ities,"\n")
detach(Laudaher)
#####
# Asymptotic confidence intervals.
r.cofint <- function(r, n, index, Level=.95) {
	asd<-rep(0,4)
	asd[1]<-1/sqrt(n-1) # Spearman
	asd[2]<-sqrt((4*n+10)/(9*n*(n-1))) # Kendall
	asd[3]<-1/sqrt(1.5*n) # Gini
	asd[4]<-1/sqrt(1.00762*(n-1)) # r4
	# Fisher-Yates-means, Fisher-Yates-medians, symmetrical BZ
	if (index<=4) {zse<-r*asd[index]} else {zse<-atan(r*(1-0.6/(n+8)))/sqrt(n-3)} 
	rlow <- r - zse * qnorm((1-Level)/2,lower.tail=FALSE);rlow<-max(-1,rlow)
 	rupp <- r + zse * qnorm((1-Level)/2,lower.tail=FALSE);rupp<-min(1,rupp)
 	out<-list(Lower.r=rlow, Upper.r=rupp)
 	return(out)
}
#
# Rajalingam S. and Zeya O. "Summative assessments by Spearman`s correlation
# coefficient: a case study. in enhancing learning: teaching & learning international 
# conference. Nov 24-26 2011. Miri, Sarawak: Curtin University

FA=c(39.09,39.77,35.62,34.69,34.42,35.57,35.94,38.68,38.41,36.4,37.95,37.03,38.03,
35.5,38.55,30.68,26.3,36.28)
SA=c(27.5,34,24,24,17,17.5,16.5,26,25.5,28.5,26.5,13,12.5,9.5,28.5,23,24.5,22)
n<-length(FA)
op<-par(mfrow=c(1,1))
plot(FA,SA, main="Academic progresses of students",
xlab="Formative assessment", ylab="summative assessment",
pch=19,col="steelblue4")
text(FA,SA,label=1:n,cex=0.8,pos=2)
abline(h=mean(SA),col="darkblue",lty=2,lwd=1)
abline(v=mean(FA),col="darkblue",lty=2,lwd=1)
par(op)
rct<-c("spearman","kendall","gini","r4", "fy1","fy2","sbz")
for (index in 1:7){
	r<-comprank(FA,SA,rct[index])$r
	cir<-r.cofint(r,n,index,Level=0.99)
	cat(rct[index],"Low:",cir$Lower,"Value:",r," Upp: ",cir$Upper.r,"\n")
}
#####
#
# Daniel, C.  Wood, F. S. Fitting Equations to Data. New York: John Wiley,
# 1971, p. 45. Pilot-plant data
# The response variable (y) corresponds to the acid content determined by
# titration and the explanatory variable (x) is the organic acid content determined 
# by extraction and weighting
y<-c(76, 70, 55, 71, 55, 48, 50, 66, 41, 43, 82, 68, 88, 58, 64, 88, 89, 88,
84, 88)
x<-c(123, 109, 62, 104, 57, 37, 44, 100, 16, 28, 138, 105, 159, 75, 88, 164,
169, 167, 149, 167)
out1<-comprank(x,y,"sp","woodbury")
out2<-comprank(x,y,"ke","woodbury")
out3<-comprank(x,y,"gi","woodbury")
out4<-comprank(x,y,"r4","woodbury")
out5<-comprank(x,y,"fy1","woodbury")
out6<-comprank(x,y,"fy2","woodbury")
out7<-comprank(x,y,"sbz","woodbury")
ind<-c(out1$r,out2$r,out3$r,out4$r,out5$r,out6$r,out7$r)
cat(out1$ities,"\n")
print(round(ind,5))
#####
#
data(DietFish);attach(DietFish)
op<-par(mfrow=c(1,1))
plot(occ_D,occ_H, main="Comparison of the diets of banded killfish",xlab="occ F_diaphanus",
ylab="occ F_heteroclitus", pch=19, cex=0.8,col="darkcyan")
text(occ_D,occ_H, labels = rownames(DietFish), cex=0.5, pos=3)
abline(h=mean(occ_D),col="darkred",lty=2,lwd=1)
abline(v=mean(occ_H),col="darkred",lty=2,lwd=1)
par(op)
a<-comprank(occ_D,occ_H,"sp","woodbury");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","gh");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","wgh");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","midrank");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","dubois");cat(a$r,a$ities,"\n")
detach(DietFish)
#####
#
data(Radiation);attach(Radiation);Radiation<-as.matrix(Radiation)
r1<-comprank(Radiation,"spearman","midrank",print=FALSE);eigen(r1$r)
r2<-comprank(Radiation,"kendall","midrank",print=FALSE);eigen(r2$r)
r3<-comprank(Radiation,"gini","midrank",print=FALSE);eigen(r3$r)
r4<-comprank(Radiation,"r4","gh",print=FALSE);eigen(r4$r)
r5<-comprank(Radiation,"fy1","gh",print=FALSE);eigen(r5$r)
r6<-comprank(Radiation,"fy2","gh",print=FALSE);eigen(r6$r)
r7<-comprank(Radiation,"sbz","gh",print=FALSE);eigen(r7$r)
detach(Radiation)
#####
#
# Correlation matrix
data(Marozzi);attach(Marozzi)
Marozzi<-as.matrix(Marozzi)
cor1<-comprank(Marozzi,"spearman","midrank",print=FALSE)
rownames(cor1$r)<-colnames(Marozzi);rownames(cor1$r)<-colnames(Marozzi)
print(round(cor1$r,3))
cor1 <-comprank(Marozzi,"kendall","midrank",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"gini","midrank",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"r4","wgh",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"fy1","wgh",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"fy2","wgh",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"sbz","wgh",print=FALSE)
print(round(cor1$r,3))
detach(Marozzi)
# }

Run the code above in your browser using DataLab