求和数据帧中所有行组合的更快方法

我有一个10000行的数据帧，我正在尝试将这些行的所有可能组合相加。根据我的计算，大约有5000万个组合。我将举一个小例子来简化我的数据：

df = Ratio     Count     Score
1         6         11
2         7         12
3         8         13
4         9         14
5         10        15

这是想要的结果：

results = Min Ratio     Max Ratio     Total Count     Total Score
1             2             13              23
1             3             21              36
1             4             30              50
1             5             40              65
2             3             15              25
2             4             24              39
2             5             34              54
3             4             17              27
3             5             27              42
4             5             19              29

这是我为完成计算而想出的代码：

for i in range(len(df)):
j = i + 1
while j <= len(df):
range_to_calc = df.iloc[i:j]
total_count = range_to_calc['Count'].sum()
total_score = range_to_calc['Score'].sum()
new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
'Total Count': total_count,
'Total Score': total_score}
results = results.append(new_row, ignore_index=True)
j = j + 1

这个代码是有效的，但据我估计，运行几分钟后，需要200个小时才能完成。我知道使用numpy会更快，但我无法理解如何构建多个数组来添加在一起。(我认为如果我只做1+2、2+3、3+4等会很容易，但要困难得多，因为我需要1+2、1+2+3、1+2+3+4等。)有没有更有效的方法来完成这个计算，这样它可以在合理的时间内运行？非常感谢。

附言：如果你想知道我想对5000万行的数据帧做什么，我实际上不需要在我的最终结果中这样做。我最终希望将结果中每一行的总分除以其总分，以获得每总分的总分值，然后显示每总分的1000个最高总分，以及每个相关的最小比率、最大比率、总分和总分。

经过这些改进后，运行10k行需要~2分钟。

对于求和计算，可以预先计算cumulative sum(cumsum)并保存。sum(i to j)等于sum(0 to j) - sum(0 to i-1)。现在sum(0 to j)是cumsum[j]，sum(0 to i - 1)是cumsum[i-1]。所以sum(i to j) = cumsum[j] - cumsum[i - 1]。对于不同的组合，这比每次计算总和有了显著的改进。
在numpy阵列上的运算比在Panda系列上的运算快，因此将每列转换为numpy阵列，然后对其进行计算。
(来自其他答案)：与其附加在列表中，不如初始化一个大小为((n*(n+1)//2) -n , 4)的空numpy数组，并使用它来保存结果。

使用：

count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
for j in range(i + 1, len(df)):
arr[k, :] = ([
count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j], 
score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
ratios[i],
ratios[j]])
k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score', 
'Min_Ratio', 'Max_Ratio'])

输入：

df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5], 
'Count': [6, 7, 8, 9, 10],
'Score': [11, 12, 13, 14, 15]})

输出：

>>>out
Min_Ratio Max_Ratio   Total_Count Total_Score
0   1     2              13                 23
1   1     3              21                 36
2   1     4              30                 50
3   1     5              40                 65
4   2     3              15                 25
5   2     4              24                 39
6   2     5              34                 54
7   3     4              17                 27
8   3     5              27                 42
9   4     5              19                 29

首先，您可以改进算法。然后，您可以使用Numpy矢量化/广播来加快计算速度。

以下是提高算法性能的有趣之处：

Pandas的append速度较慢，因为它重新创建了一个新的数据帧。你永远不应该在一个昂贵的循环中使用它。相反，您可以将行附加到Python列表中，甚至可以直接将项写入预先分配的Numpy向量中

计算部分和需要O(n)时间，而您可以预先计算累积和，然后在恒定时间内找到部分和

CPython循环非常慢，但由于广播，可以使用Numpy对内部循环进行矢量化

这是生成的代码：

import numpy as np
import pandas as pd
def fastImpl(df):
n = len(df)
resRowCount = (n * (n+1)) // 2
k = 0
cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
ratios = df['Ratio'].astype(int)
minRatio = np.empty(resRowCount, dtype=int)
maxRatio = np.empty(resRowCount, dtype=int)
count = np.empty(resRowCount, dtype=int)
score = np.empty(resRowCount, dtype=int)
for i in range(n):
kStart, kEnd = k, k+(n-i)
jStart, jEnd = i+1, n+1
minRatio[kStart:kEnd] = ratios[i]
maxRatio[kStart:kEnd] = ratios[i:n]
count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
k = kEnd
assert k == resRowCount
return pd.DataFrame({
'Min Ratio': minRatio,
'Max Ratio': maxRatio,
'Total Count': count,
'Total Score': score
})

请注意，此代码给出的结果与问题中的代码相同，但原始代码没有给出问题中所述的预期结果。还要注意的是，由于输入是整数，为了性能起见，我强制Numpy使用整数(尽管算法也应该使用浮点运算)。

此代码比大数据帧上的原始代码快数十万倍，并且它成功地在0.7秒内计算出10000行的数据帧。

其他人已经解释了为什么你的算法如此缓慢，所以我将深入研究。

让我们用不同的方法来解决你的问题。特别是，看看Total Count和Total Score列是如何计算的：

计算从1到n的每一行的累积和
计算从2到n的每一行的累积和
计算从n到n的每一行的累积和

由于累积和是累积的，我们只需要为第1行到第n行计算一次：

(2到n)的总和是(1到n)-(第1行)的总和
(3到n)的总和是(2到n)-(第2行)的总和
等等

换句话说，当前cumsum是前一个cumsum减去它的第一行，然后去掉第一行。

正如你所推测的，pandas比numpy慢得多，所以我们将把everthing转换为numpy以提高速度：

arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array
tmp = np.cumsum(arr[:, 1:3], axis=0)       # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column
results2 = [tmp]
for i in range(1, len(arr)):
tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
diff = results2[-1][0] # the previous cumsum's first row
tmp -= diff            # adjust the current cumsum
tmp[:, 0] = arr[i, 0]  # new Min Ratio
tmp[:, 1] = arr[i:, 0] # new Max Ratio
results2.append(tmp)
# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])

在我的测试过程中，这会在大约2秒内产生10k行数据帧的结果。

很抱歉为这个主题写得太晚了，但我只是在寻找类似主题的解决方案。这个问题的解决方案很简单，因为组合只是成对的。这可以通过将数据帧上传到任何DB并执行以下持续时间小于10秒的查询来解决：

SEL f1.*,f2.*,f1.score+f2.score 
FROM table_with_data_source f1, table_with_data_source f2
where f1.ratio<>f2.ratio;

即使有100000条或更多的记录，数据库也会做得很快。

然而，我在答案中看到的算法中没有一个真正执行值的组合。他只成对做。当它是一个真正的组合时，问题真的会变得复杂，例如：

给定：a、b、c、d和e作为记录：

a
b
c
d
e

真正的组合是：

a+b
a+c
a+d
a+e
a+b+c
a+b+d
a+b+e
a+c+d
a+c+e
a+d+e
a+b+c+d
a+b+c+e
a+c+d+e
a+b+c+d+e
b+c
b+d
b+e
b+c+d
b+c+e
b+d+e
c+d
c+e
c+d+e
d+e

这是一个真正的组合，它涵盖了所有可能的组合。对于这种情况，我一直无法找到合适的解决方案，因为它确实会影响任何硬件的性能。有人知道如何使用python执行真正的组合吗？在数据库级别，它会影响数据库的总体性能。

相关内容

最新更新

热门标签：