Correlation Analysis
Correlation Analysis
1. Pearson (or Bravais–Pearson) correlation coefficient
1.1 Formulation
Assumption:
- Two variables \(X\) and \(Y\) are measured on a continuous scale
- They are linearly related
The correlation coefficient \(r(X,Y)=r\) measures the degree of linear relationship between \(X\) and \(Y\) using
where:
1.2 Properties
-
The correlation coefficient is independent of the units of measurement of \(X\) and \(Y\).
- e.g., if someone measures the height and weight in metres and kilograms respectively and another person measures them in centimetres and grams, respectively, then the correlation coefficient between the two sets of data will be the same.
-
The correlation coefficient is symmetric
- i.e., \(r(X,Y)=r(Y,X)\)
-
The limits of \(r\) are \(−1 \leq r \leq 1\).
-
If the relationship between X and Y is
- perfectly linear and increasing, then \(r=+1\) and
- perfectly linear and decreasing, then \(r=-1\).
- The signs of r thus determine the direction of the association.
-
If \(r\) is close to zero, then it indicates that the variables are independent or the relationship is not linear.
1.3 \(t\)-test of a sample (pearson) correlation coefficient
Assumption:
- The relationship is linear.
- Random variables \(X\) and \(Y\) originate from a bivariate normal distribution
- Population correlation coefficient is zero.
Null hypothesis \(H_0\): samples correlation coefficient is zero, i.e., \(r=0\)
The test statistic
is calculated and this follows Student's t-distribution with \(n − 2\) degrees of freedom.
1.4 \(Z\)-test of a population (pearson) correlation coefficient
Assumption:
- The relationship is linear.
- Random variables \(X\) and \(Y\) originate from a bivariate normal distribution
- The variance of \(Y\) is independent of \(X\).
- The sample correlation coefficient is observed as \(r\)
- Correlation coefficient of population has a specified value \(\rho_0\).
Null hypothesis \(H_0\): population correlation coefficient is equal to \(\rho_0\), i.e., \(\rho=\rho_0\)
Using the Fisherv \(Z\)-transformation we have:
Where \(r\) is the sample correlation coefficient.
The distribution of \(Z_1\) is approximately normal with mean \(\mu_{Z_1}\) and standard deviation \(\sigma_{Z_1}\) where
The test statistic is
1.4 Implementation
(1) R
(a) Pearson correlation coefficient
cor(x1, x2, method='pearson')
# x1, x2 are two vector
cor(data, method='pearson')
# data is dataframe
Self Codes
# self-coding
cor.pearson = function(x, y) {
x.mean = mean(x)
y.mean = mean(y)
r = sum((x - x.mean) * (y - y.mean)) / sqrt(sum((x - x.mean)^2) * sum((y - y.mean)^2))
return (r)
}
Example. The following example is selected from Heumann et al.'s book (2016, pp. 83 Table 4.7)
x1 = c(10.85, 10.44, 10.50, 10.89, 10.62)
x2 = c( 7.84, 7.96, 7.81, 7.47, 7.74)
cor(x1, x2, method='pearson')
cor.pearson(x1, x2)
# Output: [1] -0.6932787 [2] -0.6932787
(b) t-test
# self-coding
cor.pearson.test.t = function(x, y) {
n = length(x)
r = cor.pearson(x, y)
Sr = sqrt((1 - r^2) / (n-2))
t = abs(r) / Sr
return (t)
}
cor.test() function is to execute the t-test
cor.test(x1, x2, method="pearson", conf.level=0.95, alternative="two.sided")$statistic
cor.pearson.test.t(x1, x2)
# Output [1] -1.66622 [2] 1.66622
Note that Student's t is a symmetric distribution.
corr.test() function (require install psych package) is to execute the t-test for multi-pair variables
states <- state.x77
states = as.data.frame(states)
states = na.omit(states)
library(psych)
corr.test(states, method="pearson", use="pairwise", alpha=0.05)
(c) Z-test
# self-coding
cor.pearson.test.z = function(x, y, rho=0) {
n = length(x)
r = cor.pearson(x, y)
z1 = atanh(r)
mu1 = atanh(rho)
sigma1 = 1 / sqrt(n-3)
z = (z1 - mu1) / sigma1
return (z)
}
cor.pearson.test.z(x1, x2, rho=0.2)
cor.pearson.test.z(x1, x2, rho=0.7)
cor.pearson.test.z(x1, x2, rho=-0.7)
# Output: [1] -1.494787 [2] -2.434628 [3] 0.01846855
(2) Python
Pearson correlation coefficient
numpy.corrcoef(x, y=None)
- return : correlation coefficient matrix, the diagonal entries will be ones.
scipy.stats.pearsonr(x, y)
- return:
r: Pearson's correlation coefficient.p-value: Two-tailed p-value
s1.corr(s2, method='pearson')
# s1, s2 : type of pandas.Series
df1.corr(method='pearson')
# df1 : type of pandas.DataFrame
Example
2. Spearman's Rank Correlation Coefficient
2.1 Formulation
Assumption:
- two ranked variables \(X\) and \(Y\)
- the lowest score \(x_i\) is assigned rank 1
This measure is suitable for both ordinal and continuous variables.
Denotations:
- \(\text{Rank}(x_i)\) denote the rank of the \(i\)th observation on \(X\)
- \(\text{Rank}(x_i)\) is the rank \(x_i\) among the ordered values of \(X\)
- For the tied samples (i.e., \(x_i=x_j\)), the rank of them will be same and is the average rank.
- \(\text{Rank}(y_i)\) denotes the rank of the \(i\)th observation of \(y\)
- \(d_i = \text{Rank}(x_i) − \text{Rank}(y_i)\) is the difference between the two rank values
Spearman's rank correlation coefficient is defined as the Pearson product moment sample correlation of the \(R(x_i)\) and the \(R(y_i)\). That is
When no ties within a sample are present, this is equivalent to the computationally efficient formula (Hollander et al. 2004, pp. 428 & Heumann et al. 2016, pp. 84):
If there are tied \(X\)'s and/or tied \(Y\)'s, the above formula should be revised, can refer to the Equation 8.77 in page 429 of Hollander et al.'s book (2004).
2.2 Properties
-
The values of \(R\) lie between \(−1\) and \(+1\) and measure the degree of correlation between the ranks of \(X\) and \(Y\).
-
The value of \(R\) remains the same whether we choose an ascending or descending order of the ranks.
-
\(R=1\) represents all the observations are assigned exactly the same ranks.
-
\(R=-1\) represents the observations are assigned exactly the opposite ranks.
2.3 The Spearman rank correlation test (paired observations)
Assumption:
For large samples (\(n > 10\)), \(R\) asymptotically follows the normal distribution with mean zero and variance \(\frac{1}{n-1}\) (Hollander et al. 2004, pp. 427). Thus, the test statistic is
Note that Kanji (2006, pp.109) also gives a formulation, but it is additive inverse of above formulation.
For small samples, the test statistic is (Kanji 2006, pp. 109)
must be compared with critical values obtained from Table 26 in Kanji's book (2006, p.p. 226).
2.4 Implementation
(1) R
(a) Spearman's Rank Correlation Coefficient
# compute by definition, used for tied and non-tied sample
cor.spearman.1 = function(x, y) {
x.rank = rank(x)
y.rank = rank(y)
r = cor.pearson(x.rank, y.rank)
return (r)
}
# computationally efficient formula, only used for non-tied sample
cor.spearman.2 = function(x, y) {
n = length(x)
x.rank = rank(x)
y.rank = rank(y)
r = 1 - 6 * sum((x.rank - y.rank)^2) / (n * (n^2 - 1))
return (r)
}
Examples
# non-tied samples
x1 = c(10.85, 10.44, 10.50, 10.89, 10.62)
x2 = c( 7.84, 7.96, 7.81, 7.47, 7.74)
cor(x1, x2, method='spearman') # used for both tied and non-tied sample
cor.spearman.1(x1, x2) # used for both tied and non-tied sample
cor.spearman.2(x1, x2) # only used for non-tied sample
# output
# [1] -0.7 [2] -0.7 [3] -0.7
# tied samples
x1 = c(7.1, 7.1, 7.2, 8.3, 9.4, 10.5, 11.4)
x2 = c(2.8, 2.9, 2.8, 2.6, 3.5, 4.6, 5.0)
cor(x1, x2, method='spearman') # used for both tied and non-tied sample
cor.spearman.1(x1, x2) # used for both tied and non-tied sample
cor.spearman.2(x1, x2) # only used for non-tied sample
# output
# [1] 0.7 [2] 0.7 [3] 0.7053571
(b) Hypothesis testing for non-tied samples
cor.spearman.test.z = function(x, y) {
n = length(x)
S = sum((rank(x) - rank(y))^2)
r = cor.spearman.1(x, y)
z = r / sqrt(1 / (n-1))
res = list(stats=z, S=S)
return (res)
}
Using self-codes cor.spearman.test.z()
# non-tied samples
x1 = c(10.85, 10.44, 10.50, 10.89, 10.62)
x2 = c( 7.84, 7.96, 7.81, 7.47, 7.74)
cor.spearman.test.z(x1, x2)
# Output: z = -1.4 S = 34
Using cor.test(x, y, method="spearman") function
cor.test(x1, x2, method="spearman", conf.level=0.95, alternative="two.sided")
# Output: S = 34, p-value = 0.2333
Note that the statistic S of cor.test() is \(\sum_{i=1}^{n} d_i^2\), i.e., sum((rank(x1) - rank(y1))^2).
Use pSpearman() function (require install SuppDists package)
library(SuppDists)
pSpearman(q, r, lower.tail=TRUE, log.p=FALSE)
- Parameters:
q: quantity, here is Spearman's rank correlation coefficientr: number of observations
- Return: probability, i.e., p-value
r = cor.spearman.1(x1, x2)
library(SuppDists)
pSpearman(q=-abs(r), r=length(x1), lower.tail=T)*2 # two-sided test
# Output: p-value = 0.2333333
(c) Hypothesis testing for tied samples
Example : the following tied example is selected from Hollander et al.'s book (2014, Example 8.5, pp. 430)
# tied samples
x1 = c(7.1, 7.1, 7.2, 8.3, 9.4, 10.5, 11.4)
x2 = c(2.8, 2.9, 2.8, 2.6, 3.5, 4.6, 5.0)
r = cor(x1, x2, method="spearman")
- Using self codes
cor.spearman.test.z()
cor.spearman.test.z(x1, x2)
# Output: S = 16.5; z = 1.714643
z = cor.spearman.test.z(x1, x2)$z
pnorm(-abs(z), lower.tail=T)*2
# Output: p-value = 0.08641073
Note that the statistic z is in agreement with the result in Hollander et al.'s book (pp. 431, \(r_s^*\))
-
Using
cor.test(x1, x2, method="spearman")for tied samples will raise a warning -
Using
pSpearman()
r = cor(x1, x2, method="spearman")
pSpearman(q=-abs(r), r=length(x1))*2
# output
[1] 0.08809524
(2) Python
3. Kendall Rank Correlation Coefficient
3.1 Formulation
Denotations:
- \((x_{1},y_{1}),...,(x_{n},y_{n})\) a ordered set of observations of the joint random variables \(X\) and \(Y\), such that all the values of \(x_{i}\) and \(y_{i}\) are unique (ties are neglected for simplicity).
Definitions:
- For any pair of observations \((x_{i},y_{i})\) and \((x_{j},y_{j})\), where \(i<j\):
- concordant: if \(x_{i}>x_{j}\) and \(y_{i}>y_{j}\) (or \(x_{i}<x_{j}\) and \(y_{i}<y_{j}\))
- discordant: if \(x_{i}>x_{j}\) and \(y_{i}<y_{j}\) (or \(x_{i}<x_{j}\) and \(y_{i}>y_{j}\))
- tied: if \(x_{i}=x_{j}\) or \(x_{i}=x_{j}\)
The Kendall \(\tau\) coefficient is defined as:
where \({n \choose 2}={n(n-1) \over 2}\) is the binomial coefficient for the number of ways to choose two items from \(n\) items.
Further denotations:
- When \(x_i − x_j\) and \(y_i − y_j\) have the same sign a score of \(+1\) is obtained.
- When they have opposite signs a score of \(−1\) is obtained.
- When there is a difference of zero, no score is obtained.
- These scores are summed together and this sum is denoted \(S\).
The explicit expression:
where:
- \(\text{sgn}(\cdot)\) is the sign function
3.2 Properties
-
The denominator is the total number of pair combinations, thus, \(-1 \leq \tau \leq 1\).
-
If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
-
If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
-
If \(X\) and \(Y\) are independent, then we would expect the coefficient to be approximately zero.
3.3 The Kendall rank correlation test (paired observations)
Hypothesis test for \(\tau\) (Kanji, 2006 & Wikipedia 2022)
Null Hypothesis \(H_0\): \(\tau = 0\)
For large simple size \(n (n > 10)\), \(\tau\) approximately follows a normal distribution with mean zero and variance \(\frac {2(2n+5)}{9n(n-1)}\) (Hollander et al. 2004, pp. 393-399 & Wikipedia 2022). Thus, the test statistic
For small samples, critical values of \(S\) may be obtained from Table.
3.4 Accounting for ties
-
A pair \((x_{i},x_{j}),(y_{i},y_{j})\) is said to be tied if \(x_{i}=x_{j}\) or \(y_{i}=y_{j}\)
-
A tied pair is neither concordant nor discordant.
-
When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range \([−1, 1]\).
Tau-a \(\tau_A\)
- The Tau-a statistic tests the strength of association of the cross tabulations
- Both variables have to be ordinal
- Tau-a will not make any adjustment for ties
where:
- \(n_0 = n(n-1)/2\)
- \(n_{c} = \text{Number of concordant pairs}\)
- \(n_{d} = \text{Number of discordant pairs}\)
Tau-b \(\tau_B\)
- A value of zero indicates the absence of association.
where:
- \(n_1 = \sum_i t_i (t_i-1)/2\)
- \(n_2 = \sum_j u_j (u_j-1)/2\)
- \(t_{i} = {\text{Number of tied values in the }}i^{\text{th}}{\text{ group of ties for the first quantity}}\)
- \(u_{j} = {\text{Number of tied values in the }}j^{\text{th}}{\text{ group of ties for the second quantity}}\)
Tau-c \(\tau_C\) (also called Stuart-Kendall Tau-c)
-
Tau-c is more suitable than Tau-b for the analysis of data based on non-square contingency tables (i.e. rectangular table, the row and column of the table are different).
-
Use Tau-b if the underlying scale of both variables has the same number of possible values (before ranking) and Tau-c if they differ.
where:
- \(r={\text{Number of rows of the contingency table}}\)
- \(c={\text{Number of columns of the contingency table}}\)
- \(m=\min(r,c)\)
Hypothesis Test
The statistic test for \(\tau_A\) and \(\tau_B\) can refer to Wikipedia.
3.4 Implementation
(1) R
(a) Kendall Rank Correlation Coefficient
# tau_A, only used for non-tied samples
cor.kendall.1 = function(x, y) {
n = length(x)
S = 0
for (i in 1:(n-1)){ for (j in (i+1):n){
S = S + sign(x[i]-x[j]) * sign(y[i]-y[j])
} }
r = 2 / (n * (n-1)) * S
return (r)
}
# tau_B, used for tied and non-tied samples. When sample is non-tied, tau_B = tau_A
cor.kendall.2 = function(x, y) {
n = length(x)
S = c(0, 0, 0, 0) # names(S) = c('nc', 'nd', 'n1', 'n2')
for (i in 1:(n-1)){ for (j in (i+1):n){
sign1 = sign(x[i]-x[j])
sign2 = sign(y[i]-y[j])
if (sign1 * sign2 > 0) { S[1] = S[1] + 1 }
else if (sign1 * sign2 < 0) { S[2] = S[2] + 1 }
else if (sign1 * sign2 == 0) {
if (sign1 != 0) { S[4] = S[4] + 1 }
else if (sign2 != 0) { S[3] = S[3] + 1 }
} } }
r = (S[1] - S[2]) / sqrt((n*(n-1) / 2 - S[3]) * (n*(n-1) / 2 - S[4]))
return (r)
}
cor(x1, x2, method='kendall') function computes Tau-b
Example
# Example: non-tied samples
x1 = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
x2 = c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
cor(x1, x2, method='kendall') # used for tied and non-tied samples
cor.kendall.2(x1, x2) # tau_B, used for tied and non-tied samples
cor.kendall.1(x1, x2) # tau_A, only used for non-tied samples
# Output: [1] 0.4444444 [2] 0.4444444 [3] 0.4444444
# Example: tied samples
x1 = c(7.1, 7.1, 7.2, 8.3, 9.4, 10.5, 11.4)
x2 = c(2.8, 2.9, 2.8, 2.6, 3.5, 4.6, 5.0)
cor(x1, x2, method='kendall') # used for tied and non-tied samples
cor.kendall.2(x1, x2) # tau_B, used for tied and non-tied samples
cor.kendall.1(x1, x2) # tau_A, only used for non-tied samples
# Output: [1] 0.55 [2] 0.55 [3] 0.5238095
(b) Hypothesis Testing for non-tied samples
# Self-coding
cor.kendall.test.z = function(x, y){
n = length(x)
r = cor.kendall.1(x, y)
z = r / sqrt( 2 * (2*n+5) / (9 * n * (n-1)))
S = 0 # the number of concordant pairs
for (i in 1:(n-1)) { for (j in (i+1):n) {
s1 = sign(x[i]-x[j]) * sign(y[i]-y[j])
if (s1 > 0) { S = S + s1}
} }
res = list("S"=S, "z"=z)
return (res)
}
Samples with no ties. The following example with non-tied samples is selected from Hollander et al.'s book (Example 8.1, pp. 397)
x1 = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
x2 = c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
# Kendall Correlation Coefficient
r = cor(x1, x2, method="kendall")
# r = 0.4444444444
Using self codes
cor.kendall.test.z(x1, x2)
# Output: S = 26; z = 1.668115
z = cor.kendall.test.z(x1, x2)$z
pnorm(-abs(z), lower.tail=T)*2 # two-sided test
# p-value = 0.09852102
Note that the statistic z is in agreement with the result in Hollander et al.'s book (pp. 399, K*)
Using cor.test(x1, x2, method='kendall') function
# H0: r = 0
cor.test(x1, x2, method='kendall', conf.level=0.95, alternative='two.sided')
# Output: T = 26; p-value = 0.1194
Note that the statistic T of cor.test(method='kendall') is the number of concordant pairs
Use pKendall() function (require install SuppDists package)
library(SuppDists)
# p-value of two-sided test
pKendall(-abs(r), N=length(x1), lower.tail=T)*2 # two-sided test, H0 : r=0
# Output: 0.1194389
(c) Hypothesis Testing for tied samples
Using cor.test(x1, x2, method="kendall") for tied samples will raise a warning
Using pKendall()
# Example: tied samples
x1 = c(7.1, 7.1, 7.2, 8.3, 9.4, 10.5, 11.4)
x2 = c(2.8, 2.9, 2.8, 2.6, 3.5, 4.6, 5.0)
r = cor(x1, x2, method='kendall')
library(SuppDists)
pKendall(-abs(r), N=length(x1), lower.tail=T)*2 # two-sided test, H0 : r=0
# Output : p-value = 0.1361111
(2) Python
4. Goodman and Kruskal's Gamma
4.1 Formulation
Denotation
- For the contingency table
Goodman and Kruskal's \(\gamma\)
where \(K\) and \(D\) is computed as:
where:
- \(K=n_c\) describes the number of concordant observation pairs
- \(D=n_d\) describes the number of discordant observation pairs
4.2 Properties
-
\(-1 \leq \gamma \leq 1\)
-
Larger values indicate a stronger association and the sign indicates the direction of the association
Reference
Pearson相关性分析(Pearson Correlation Analysis)——理论介绍, website
Wikipedia, Kendall rank correlation coefficient, website
Heumann, C., Schomaker, M., & Shalabh. (2016). Introduction to Statistics and Data Analysis. Springer International Publishing.
Hollander, M., Wolfe, D. A., & Chicken, E. (2014). Nonparametric statistical methods (Third edition). John Wiley & Sons, Inc.
Kanji, G. K. (2006). 100 statistical tests (3rd ed). Sage Publications.
- Test 12 t-test of a correlation coefficient, pp. 39
- Test 13 Z-test of a correlation coefficient, pp. 39
- Test 58 The Spearman rank correlation test (paired observations), pp. 109
- Test 59 The Kendall rank correlation test (paired observations), pp. 110

浙公网安备 33010602011771号