使用groupby.sum()对大型稀疏panda数据帧进行分组非常缓慢

我有大小为(607875, 12294)的熊猫数据帧。数据稀疏，看起来像：

ID BB CC DD ...
0   abc 0  0  1  ...
1   bcd 0  0  0  ...
2   abc 0  0  1  ...
...

我通过调用将其转换为稀疏形式

dataframe = dataframe.to_sparse()

后来，我用ID和sum将其分组，用将行值分组

dataframe = dataframe.groupby("ID").sum()

对于较小的数据帧，它工作得很好，但对于这种大小，它工作了一个小时，没有完成工作。有没有办法加快速度或绕过它？有没有其他稀疏方法可以使用，因为to_sparse方法已被弃用。

输出数据帧的大小将是(2000, 12294)，看起来像(如果abc列中没有其他1(：

ID BB CC DD ...
0   abc 0  0  2  ...
1   bcd 0  0  0  ...
...

我的电脑上有32GB的内存，所以应该足够了。

灵感来自https://stackoverflow.com/a/50991732/8035867这里有一个解决方案，它依赖于Sklearn对组标签进行一种稀疏的单热编码，然后使用Scipy对两个稀疏行矩阵进行点积。

编辑：改为使用一个热编码器来处理组中只有两个类的情况。

from sklearn.preprocessing import OneHotEncoder
def sparse_groupby_sum(df, groupby):
ohe = OneHotEncoder(sparse_output=True)
# Get all other columns we are not grouping by
other_columns = [col for col in df.columns if col != groupby]
# Get a 607875 x nDistinctIDs matrix in sparse row format with exactly 
# 1 nonzero entry per row
onehot = ohe.fit_transform(df[groupby].values.reshape(-1, 1))
# Transpose it. then convert from sparse column back to sparse row, as 
# dot products of two sparse row matrices are faster than sparse col with
# sparse row
onehot = onehot.T.tocsr()
# Dot the transposed matrix with the other columns of the df, converted to sparse row 
# format, then convert the resulting matrix back into a sparse 
# dataframe with the same column names
out = pd.DataFrame.sparse.from_spmatrix(
onehot.dot(df[other_columns].sparse.to_coo().tocsr()), 
columns=other_columns)
# Add in the groupby column to this resulting dataframe with the proper class labels
out[groupby] = ohe.categories_[0]
# This final groupby sum simply ensures the result is in the format you would expect 
# for a regular pandas groupby and sum, but you can just return out if this is going to be 
# a performance penalty. Note in that case that the groupby column may have changed index
return out.groupby(groupby).sum()
dataframe = sparse_groupby_sum(dataframe, "ID")

注意，出于性能目的，您可以将onehot变量的定义内联到out =行，我只是出于教学目的将其分离出来。

Pandas恐怕有其局限性，并且在100MB-1GB的相对较小的数据集中效率最高。如果您只想使用panda，一种解决方法是以块的形式从源读取数据，这将减少数据帧。或者，如果可能的话，您可以为转换过滤掉不必要的列。

在其他地方，您应该检查PySpark或Hadoop等更适合在较大数据集上进行转换的框架。

我知道这是违反直觉的，但在列上循环而不调用稀疏会更快。请尝试下面的代码。

df1 = df[['id', 'BB']].groupby(by='id').sum()
for i in df.columns[2:]:
df1[i] = df[['id', i]].groupby(by='id').sum()
# if you want to save space you can drop df columns after they are added to df1

相关内容

最新更新

热门标签：