Pandas-求和数据列,直到满足值,构建子集,冲洗并重复所有行



这里是Newb,但希望有人能帮助我编写代码,帮助我分解大型数据帧。我需要在很多行(可能是数十万行(上执行此操作,所以我想使用Pandas将所有数据放入数据帧中。在较大的数据集上尝试之前,我正在尝试用较小的数据子集来计算逻辑,我将使用dask或具有chunksize的Pandas来引入较大的数据集中。。。需要尽可能提高内存效率。

假设我有以下数据帧:

a   b  
0  10  random_data_that_I need 
1  23  random_data_that_I_need
2  45  random_data_that_I_need
3  32  random_data_that_I_need
4  15  random_data_that_I_need
5  10  random_data_that_I_need
6  34  random_data_that_I_need
7  65  random_data_that_I_need
8  20  random_data_that_I_need
9  45  random_data_that_I_need
10 11  random_data_that_I_need
11 12  random_data_that_I_need

我想做的是总结";a";列,直到达到一个值,假设我的目标阈值是50。一旦达到阈值,我想把所有让我到达那里的行都作为一个子集。如果添加下一行使我结束,那没关系,因为之前的行总和低于"50"阈值,它应该添加下一行将,但随后重新启动该过程。如果我在末尾有任何剩余的行没有达到阈值,那么把它们加起来。

所以最终结果看起来像

result_df1:
0  10  random_data_that_I need 
1  23  random_data_that_I need
2  45  random_data_that_I need
result_df2:
3  32  random_data_that_I need
4  15  random_data_that_I need
5  10  random_data_that_I need
result_df3:
6  34  random_data_that_I need
7  65  random_data_that_I need
result_df4:
8  20  random_data_that_I need
9  45  random_data_that_I need
result_df5:
10 11  random_data_that_I_need
11 12  random_data_that_I_need

结果不一定是数据帧。。。但如果是。。。

单向:

df_list = []
old_index = 0
while True:
m = df.iloc[old_index:, :].a.cumsum().sub(50).gt(0)
if any(m):
index = m.idxmax()
else:
break
df1 = df.iloc[old_index:index+1]
df_list.append(df1)
old_index = index + 1
df_list.append(df.iloc[index+1:, :])
输出:
[    a                        b
0  10  random_data_that_I_need
1  23  random_data_that_I_need
2  45  random_data_that_I_need,
a                        b
3  32  random_data_that_I_need
4  15  random_data_that_I_need
5  10  random_data_that_I_need,
a                        b
6  34  random_data_that_I_need
7  65  random_data_that_I_need,
a                        b
8  20  random_data_that_I_need
9  45  random_data_that_I_need,
a                        b
10  11  random_data_that_I_need
11  12  random_data_that_I_need]
备选方案:
sums = 0
df_list = []
old_index = 0
for index, i in enumerate(df.a):
sums += i
if sums > 50:
df_list.append(df[old_index:index+1])
old_index = index + 1
sums = 0
df_list.append(df[old_index:])
list_of_df = []
current_df = df.iloc[0:1]
for idx in range(1, df.shape[0]):
if current_df['col1'].sum() < 50:
current_df = pd.concat([current_df, df.iloc[idx:idx+1]])
else:
list_of_df.append(current_df)
current_df = df.iloc[idx:idx+1]
if idx == df.shape[0]-1:
list_of_df.append(current_df)

要获得数据帧,只需从列表中调用它,如下所示:

# get the first dataframe 
list_of_df[0]
# or if you want to output all dataframes to the console like your example:
for dataframe in list_of_df:
print(dataframe)

最新更新