基于Pandas级数高效计算差分矩阵

我有一个数据帧，我试图找到一列(系列(中不同行之间的数字差异，从而得到一个二维都等于数据帧长度的方阵。

import pandas as pd
import numpy as np
df = pd.DataFrame([[200, 2],[100,2], [1000,10], [600,5], [50,1]],
columns=['Sales','Total prods'])
print(df['Sales'])

0     200
1     100
2    1000
3     600
4      50
Name: Sales, dtype: int64

我写了这个函数：

def numerical_scoring(col_name, result_ints):
matrix_df = np.zeros(shape=(len(result_ints), len(result_ints)))
for index_x, row_x in result_ints[[col_name]].iterrows(): 
for index_y, row_y in result_ints[[col_name]].iterrows():
if index_x == index_y:
matrix_df[index_x, index_y] = 1
else: 
matrix_df[index_x, index_y] = abs(row_x[0] - row_y[0])

return matrix_df  
print(numerical_scoring('Sales', df))

[[  1. 100. 800. 400. 150.]
[100.   1. 900. 500.  50.]
[800. 900.   1. 400. 950.]
[400. 500. 400.   1. 550.]
[150.  50. 950. 550.   1.]]

这段代码适用于小型数据帧，但当数据帧增长到数百万条记录时，这需要很长时间才能完成。有没有更有效的方法来转换数据？

您可以使用pdist来计算成对距离，然后使用squareform将成对距离向量转换为方阵，最后将方阵的对角线值更新为1:

from scipy.spatial.distance import pdist, squareform
arr = squareform(pdist(df[['Sales']]))
arr[np.diag_indices(len(arr))] = 1

结果：

array([[  1., 100., 800., 400., 150.],
[100.,   1., 900., 500.,  50.],
[800., 900.,   1., 400., 950.],
[400., 500., 400.,   1., 550.],
[150.,  50., 950., 550.,   1.]])

EDIT:Shubham Sharma的答案更优雅、更高效，我推荐它。然而，我把它留在这里，因为你肯定会遇到其他情况，你必须实现自己的算法，了解如何更高效地实现这一点很好。

这里有几个地方可以改进。

您的函数构造一个单列数据帧，然后使用result_ints[[col_name]].iterrows()对其行进行迭代。使用result_ints[col_name].iteritems()处理列/系列本身并迭代其项会更简单
但您的函数只需要对一列进行操作，这就是Pandas系列。因此，我们可以将其作为参数传递，甚至不必担心数据帧
您的数据都是整数，所以结果矩阵不需要使用浮点值
最重要的一点是：你在复制很多工作您的结果矩阵沿对角线对称，因此您当前正在进行两次相同的计算

这里有一个版本跳过了一些不必要的工作：

import pandas as pd
import numpy as np
def numerical_scoring(series):
length = len(series)
matrix = np.zeros(shape=(length, length), dtype=int)

# Set the diagonal cells to 1
for i in range(length):
matrix[i, i] = 1

# Iterate through the items...
for i, i_val in series.iteritems():
# For each item, iterate only through the items after it.
for j, j_val in series[i + 1:].iteritems():
# Set both identical cells simulataneously.
matrix[i, j] = matrix[j, i] = abs(i_val - j_val)

return matrix
df = pd.DataFrame([[200, 2],[100,2], [1000,10], [600,5], [50,1]],
columns=['Sales','Total prods'])
print(numerical_scoring(df['Sales']))

[[  1 100 800 400 150]
[100   1 900 500  50]
[800 900   1 400 950]
[400 500 400   1 550]
[150  50 950 550   1]]

相关内容

最新更新

热门标签：