拆分一列的值更改符号但越过零的行 - Python 熊猫 - split rows where one column's value changed the sign but crossed zero

我有一个这样的数据帧：

symbol      Id      Volume    cumVolume   ...                                                 
00001   93050000    100         100     ...
00001   93156340    100         200     ...    
00001   94040000   -200           0     ...    
00001   94041040   -100        -100     ...    
...       ...       ...         ...                      
00002   93050000   -100        -100     ...
00002   93156340   -100        -200     ...    
00002   94040000    100        -100     ...    
00002   94041040    400         300     ...

理想情况下，符号的cumVolume通过首先达到零来改变符号，例如00001(从200到0，然后到-100(。然而，对于像00002这样的一些符号，cumVolume改变了符号，但没有达到零，而是将其交叉(从-100到300(。

我想拆分这些行，得到这样的数据帧：

symbol      Id      Volume    cumVolume   ...                                                 
00001   93050000    100         100     ...
00001   93156340    100         200     ...    
00001   94040000   -200           0     ...    
00001   94041040   -100        -100     ...    
...       ...       ...         ...                      
00002   93050000   -100        -100     ...
00002   93156340   -100        -200     ...    
00002   94040000    100        -100     ...    
00002   94041040    100           0     ...   
00002   94041040    300         300     ...

请注意，我通过将原始Volume400划分为100和300，将最后一行划分为两行，这样我的cumVolume现在可以显示零。其他列的信息应保持不变。

我有点挣扎。我试图根据索引插入行，但我的数据集很大，有很多列和符号。很难获得应该拆分的行的索引，也很难更改两列的值。如果你能给我一些提示或解决方案，那就太好了。

">很难获得应该拆分的行的索引并更改两列的值"；

这实际上相当容易，你需要跟踪符号，并使用连续值之间的差异来识别变化。

import numpy as np
s = np.sign(df['cumVolume'])

输出：

0    1
1    1
2    0
3   -1
4   -1
5   -1
6   -1
7    1
Name: cumVolume, dtype: int64

现在检查每组连续行之间的差异。如果2，我们从负切换到正，而不通过零停止。如果是-2，则从正极变为负极。

mask = s.groupby(df['symbol']).diff().abs().eq(2)
idx = s[mask].index
# Int64Index([7], dtype='int64')

现在您有了一个布尔掩码和发生隐藏切换的索引列表。

要生成最终输出，只需concat修改后的切片和原始数据帧：

df2 = (pd.concat([df[mask].assign(cumVolume=0),
df])
.sort_index()
)

输出：

symbol        Id  Volume  cumVolume
0       1  93050000     100        100
1       1  93156340     100        200
2       1  94040000    -200          0
3       1  94041040    -100       -100
4       2  93050000    -100       -100
5       2  93156340    -100       -200
6       2  94040000     100       -100
7       2  94041040     400          0
7       2  94041040     400        300

获取要拆分的索引，可以按如下方式执行。

def should_split_row(current_cumVolume, previous_cumVolume):
if current_cumVolume < 0 and previous_cumVolume > 0:
return True
elif current_cumVolume > 0 and previous_cumVolume < 0:
return True

previous_symbol = df['symbol'].iloc[0]
for index, row in df.iterrows():
current_symbol = row['symbol']
if current_symbol == previous_symbol and index !=0 and should_split_row(row['cumVolume'], df['cumVolume'].iloc[index-1]):        
print('This row should be split: index ->' , index)
previous_symbol = current_symbol

此外，要插入新行，您可以参考以下问题：是否可以使用panda在数据帧中的任意位置插入行？

拆分一列的值更改符号但越过零的行 - Python 熊猫

相关内容

最新更新

热门标签：