两个熊猫数据帧之间的快速矛兵关联



我想将矛兵相关性应用于具有相同列数(每对行的相关性(的两个熊猫数据帧。

我的目标是计算每对行 (r, s( 之间的矛兵相关性分布,其中 r 是第一个数据帧中的一行,s 是第二个数据帧中的一行。

我知道以前已经回答过类似的问题(见此(。但是,这个问题有所不同,因为我想将第一个数据帧的每一行与第二个数据帧中的所有行进行比较。此外,这是计算密集型的,由于我的数据大小,需要数小时。我想并行化它,并可能重写它以加快速度。

我尝试使用 numba,但不幸的是它失败了(与此类似的问题(,因为它似乎无法识别 scipy spearmanr。我的代码如下:

def corr(a, b):
dist = []
for i in range(a.shape[0]):
for j in range(b.shape[0]):
dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
return dist

新答案

from numba import njit
import pandas as pd
import numpy as np
@njit
def mean1(a):
n = len(a)
b = np.empty(n)
for i in range(n):
b[i] = a[i].mean()
return b
@njit
def std1(a):
n = len(a)
b = np.empty(n)
for i in range(n):
b[i] = a[i].std()
return b

@njit
def c(a, b):
''' Correlation '''
n, k = a.shape
m, k = b.shape
mu_a = mean1(a)
mu_b = mean1(b)
sig_a = std1(a)
sig_b = std1(b)
out = np.empty((n, m))
for i in range(n):
for j in range(m):
out[i, j] = (a[i] - mu_a[i]) @ (b[j] - mu_b[j]) / k / sig_a[i] / sig_b[j]
return out

r = df_test.rank(1).values
df_test.T.corr('spearman') == c(r, r)

旧答案

做一个斯皮尔曼等级关联只是做一个等级的相关性。

我们可以利用argsort来获得排名。 虽然argsortargsort确实让我们获得了排名,但我们可以通过切片分配将自己限制为一种。

def rank(a):
i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')
s = a.argsort(1)
out = np.empty_like(s)
out[i, s] = j
return out

相关

在相关秩的情况下,均值和标准差都由数组第二维的大小预先确定。

你可以在没有numba的情况下完成同样的事情,但我假设你想要它。

from numba import njit
@njit
def c(a, b):
n, k = a.shape
m, k = b.shape
mu = (k - 1) / 2
sig = ((k - 1) * (k + 1) / 12) ** .5
out = np.empty((n, m))
a = a - mu
b = b - mu
for i in range(n):
for j in range(m):
out[i, j] = a[i] @ b[j] / k / sig ** 2
return out

对于后代,我们可以完全避免内部循环,但这可能会有内存问题。

@njit
def c1(a, b):
n, k = a.shape
m, k = b.shape
mu = (k - 1) / 2
sig = ((k - 1) * (k + 1) / 12) ** .5
a = a - mu
b = b - mu
return a @ b.T / k / sig ** 2

示范

np.random.seed([3, 1415])
a = np.random.randn(2, 10)
b = np.random.randn(2, 10)
rank_a = rank(a)
rank_b = rank(b)
c(rank_a, rank_b)
array([[0.32121212, 0.01818182],
[0.13939394, 0.55151515]])

如果您正在使用DataFrame

da = pd.DataFrame(a)
db = pd.DataFrame(b)
pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)

0         1
0  0.321212  0.018182
1  0.139394  0.551515

验证

我们可以使用pandas.DataFrame.corr进行快速验证

pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)
0     1
0  True  True
1  True  True

下面是一个基于行的、未编译的scipy.stats.spearmanr版本,它在大型数据集上使用 ~ 5% 的时间,并有一个示例显示它会产生相同的结果:

import numpy as np
import pandas as pd

def spearman_row(x, y):
x = np.asarray(x)
y = np.asarray(y)
rx = rankdata_average(x)
ry = rankdata_average(y)
# print(rx)
# print(ry)
return compute_corr(rx, ry)
def compute_corr(x, y):
# Thanks to https://github.com/dengemann
def ss(a, axis):
return np.sum(a * a, axis=axis)
x = np.asarray(x)
y = np.asarray(y)
mx = x.mean(axis=-1)
my = y.mean(axis=-1)
xm, ym = x - mx[..., None], y - my[..., None]
r_num = np.add.reduce(xm * ym, axis=-1)
r_den = np.sqrt(ss(xm, axis=-1) * ss(ym, axis=-1))
with np.errstate(divide='ignore', invalid="ignore"):
r = r_num / r_den
return r

def rankdata_average(data):
"""Row-based rankdata using method=mean"""
dc = np.asarray(data).copy()
sorter = np.apply_along_axis(np.argsort, 1, data)
inv = np.empty(data.shape, np.intp)
ranks = np.tile(np.arange(data.shape[1]), (len(data), 1))
np.put_along_axis(inv, sorter, ranks, axis=1)
dc = np.take_along_axis(dc, sorter, 1)
res = np.apply_along_axis(lambda r: r[1:] != r[:-1], 1, dc)
obs = np.column_stack([np.ones(len(res), dtype=bool), res])
dense = np.take_along_axis(np.apply_along_axis(np.cumsum, 1, obs), inv, 1)
len_r = obs.shape[1]
nonzero = np.count_nonzero(obs, axis=1)
obs = pd.DataFrame(obs)
nonzero = pd.Series(nonzero)
dense = pd.DataFrame(dense)
ranks = []
for _nonzero, nzdf in obs.groupby(nonzero, sort=False):
nz = np.apply_along_axis(lambda r: np.nonzero(r)[0], 1, nzdf)
_count = np.column_stack([nz, np.ones(len(nz)) * len_r])
_dense = dense.reindex(nzdf.index).values
_result = 0.5 * (np.take_along_axis(_count, _dense, 1) + np.take_along_axis(_count, _dense - 1, 1) + 1)
result = pd.DataFrame(_result, index=nzdf.index)
ranks.append(result)
final = pd.concat(ranks).sort_index()
return final

if __name__ == "__main__":
from scipy.stats import rankdata, spearmanr
from time import time
np.random.seed(0)
size = int(1e5), 5
d1 = np.random.randint(5, size=size)
d2 = np.random.randint(5, size=size)
start = time()
actual = spearman_row(d1, d2)
end = time()
print("actual", actual)
print("rowbased took", end - start)
start = time()
expected = []
for i in range(len(d1)):
expected.append(spearmanr(d1[i], d2[i]).correlation)
end = time()
print("scipy took", end - start)
expected = np.array(expected)
print("largest diff", pd.Series(expected - actual).abs().max())

它打印:

rowbased took 3.6308434009552
scipy took 53.552557945251465
largest diff 2.220446049250313e-16 

Pandas 在spearman的支持下具有corr函数。它适用于列,因此我们可以转置数据帧。

我们将 df1 附加到 df2 并通过迭代每一行来计算相关性

len_df1 = df1.shape[0]
df2_index = df2.index.values.tolist()

df = df2.append(df1).reset_index(drop=True).T
values = {i: [df.iloc[:,df2_index+[i]].corr(method='spearman').values] for i in range(len_df1)}

最新更新