如何使用熊猫应用程序创建自定义序列



我需要向包含数字序列的python-pandas数据帧添加一列。数字大多是按一个数字递增的,但有时模式会中断,并设置为固定数字……其他时候则会按更大的数字递增。最重要的是,我显然不能使用groupbyrank。我想创建一个自定义函数并使用apply来执行它,但如何确保每一行的输出都在下一行中使用而不循环?示例:

10
15
20
200
205
210
215

如果我必须使用itertuples来循环,那么最快的方法是创建一个series,然后在循环结束后将series转换为列吗?df的行计数大约为160k

下面是一个示例脚本,说明如何在行之间循环。我想做的是使用apply方法(或矢量化(。有可能吗?

import pandas as pd
import re
# Sample (fake) function
def create_sequence(id, sub_id, previous_sequence):
# EDIT:  Added reset to 5 if sub_id = 1 after first two answers were already given
if sub_id == 1:
sequence = 5
elif re.match('^4.*', id) and sub_id == 3:
sequence = previous_sequence + 200
elif re.match('^T.*', id) and sub_id == 6:
sequence = previous_sequence + 60
else:
sequence = previous_sequence + 5
return sequence
df = pd.DataFrame([
{'main_id': '4G5KW5', 'irrelevant_field': 'WT', 'sub_id': 1},
{'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 2},
{'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 3},
{'main_id': '4G5KW5', 'irrelevant_field': 'WG', 'sub_id': 4},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 1},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 2},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WG', 'sub_id': 3},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 4},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 5},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 6},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 7},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 8},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WR', 'sub_id': 9}
])
# Ideally I'd like to do something like this, although I know create_sequence
# would have to be modified to accept only row as an argument
# df[sequence] = df.apply(create_sequence, axis=1)
# Loop approach:
seq_df = pd.DataFrame(columns=['main_id', 'sub_id', 'sequence'])
prev_sequence = 0
for row in df.itertuples():
main_id = row[1]
sub_id = row[3]
sequence = create_sequence(main_id, sub_id, prev_sequence)
print(main_id, sub_id, sequence)
seq_df.loc[len(seq_df.index)] = [main_id, sub_id, sequence]
prev_sequence = sequence

final_df = pd.merge(df.astype(str, skipna=False), seq_df.astype(str, skipna=False), how='left', on=['main_id', 'sub_id'])
print(final_df)

最终结果:

main_id irrelevant_field sub_id sequence
0   4G5KW5               WT      1        5
1   4G5KW5               WR      2       10
2   4G5KW5               WR      3      210
3   4G5KW5               WG      4      215
4   T3OFB2               WT      1        5
5   T3OFB2               WT      2       10
6   T3OFB2               WG      3       15
7   T3OFB2               WT      4       20
8   T3OFB2               WS      5       25
9   T3OFB2               WS      6       85
10  T3OFB2               WS      7       90
11  T3OFB2               WT      8       95
12  T3OFB2               WR      9      100

要获得sub_id的上一个值,请使用shift–它将整列下移一行:

>>> df['prev_sub_id'] = df.sub_id.shift(fill_value=0)
>>> df.head()
main_id irrelevant_field  sub_id  prev_sub_id
0   4G5KW5               WT       1            0
1   4G5KW5               WR       2            1
2   4G5KW5               WR       3            2
3   4G5KW5               WG       4            3
4   T3OFB2               WT       1            4

但是你似乎并不真的需要这个。

相反,您可以使用累积和。将5设置为每行要添加的默认值。然后,更新要增量的值不同的行:

>>> df['sequence'] = 5
>>> df.loc[df.main_id.str.startswith('4') & df.sub_id.eq(3), 'sequence'] = 200
>>> df.loc[df.main_id.str.startswith('T') & df.sub_id.eq(6), 'sequence'] = 60
>>> df.head()
main_id irrelevant_field  sub_id  prev_sub_id  sequence
0   4G5KW5               WT       1            0         5
1   4G5KW5               WR       2            1         5
2   4G5KW5               WR       3            2       200
3   4G5KW5               WG       4            3         5
4   T3OFB2               WT       1            4         5

最后,将它们与cumsum相加。

>>> df['sequence'] = df.sequence.cumsum()
>>> df
main_id irrelevant_field  sub_id  sequence
0   4G5KW5               WT       1         5
1   4G5KW5               WR       2        10
2   4G5KW5               WR       3       210
3   4G5KW5               WG       4       215
4   T3OFB2               WT       1       220
5   T3OFB2               WT       2       225
6   T3OFB2               WG       3       230
7   T3OFB2               WT       4       235
8   T3OFB2               WS       5       240
9   T3OFB2               WS       6       300
10  T3OFB2               WS       7       305
11  T3OFB2               WT       8       310
12  T3OFB2               WR       9       315

可以对组执行操作,例如

def set_sequence(frame):
frame['sequence'] = 5
frame.loc[frame.main_id.str.startswith('4') & frame.sub_id.eq(3), 'sequence'] = 200
frame.loc[frame.main_id.str.startswith('T') & frame.sub_id.eq(6), 'sequence'] = 60
frame['sequence'] = frame.sequence.cumsum()

return frame
>>> df.groupby('main_id').apply(set_sequence)
main_id irrelevant_field  sub_id  sequence
0   4G5KW5               WT       1         5
1   4G5KW5               WR       2        10
2   4G5KW5               WR       3       210
3   4G5KW5               WG       4       215
4   T3OFB2               WT       1         5
5   T3OFB2               WT       2        10
6   T3OFB2               WG       3        15
7   T3OFB2               WT       4        20
8   T3OFB2               WS       5        25
9   T3OFB2               WS       6        85
10  T3OFB2               WS       7        90
11  T3OFB2               WT       8        95
12  T3OFB2               WR       9       100

最新更新