我需要向包含数字序列的python-pandas数据帧添加一列。数字大多是按一个数字递增的,但有时模式会中断,并设置为固定数字……其他时候则会按更大的数字递增。最重要的是,我显然不能使用groupby
和rank
。我想创建一个自定义函数并使用apply
来执行它,但如何确保每一行的输出都在下一行中使用而不循环?示例:
10
15
20
200
205
210
215
如果我必须使用itertuples
来循环,那么最快的方法是创建一个series
,然后在循环结束后将series
转换为列吗?df
的行计数大约为160k
。
下面是一个示例脚本,说明如何在行之间循环。我想做的是使用apply方法(或矢量化(。有可能吗?
import pandas as pd
import re
# Sample (fake) function
def create_sequence(id, sub_id, previous_sequence):
# EDIT: Added reset to 5 if sub_id = 1 after first two answers were already given
if sub_id == 1:
sequence = 5
elif re.match('^4.*', id) and sub_id == 3:
sequence = previous_sequence + 200
elif re.match('^T.*', id) and sub_id == 6:
sequence = previous_sequence + 60
else:
sequence = previous_sequence + 5
return sequence
df = pd.DataFrame([
{'main_id': '4G5KW5', 'irrelevant_field': 'WT', 'sub_id': 1},
{'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 2},
{'main_id': '4G5KW5', 'irrelevant_field': 'WR', 'sub_id': 3},
{'main_id': '4G5KW5', 'irrelevant_field': 'WG', 'sub_id': 4},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 1},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 2},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WG', 'sub_id': 3},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 4},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 5},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 6},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WS', 'sub_id': 7},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WT', 'sub_id': 8},
{'main_id': 'T3OFB2', 'irrelevant_field': 'WR', 'sub_id': 9}
])
# Ideally I'd like to do something like this, although I know create_sequence
# would have to be modified to accept only row as an argument
# df[sequence] = df.apply(create_sequence, axis=1)
# Loop approach:
seq_df = pd.DataFrame(columns=['main_id', 'sub_id', 'sequence'])
prev_sequence = 0
for row in df.itertuples():
main_id = row[1]
sub_id = row[3]
sequence = create_sequence(main_id, sub_id, prev_sequence)
print(main_id, sub_id, sequence)
seq_df.loc[len(seq_df.index)] = [main_id, sub_id, sequence]
prev_sequence = sequence
final_df = pd.merge(df.astype(str, skipna=False), seq_df.astype(str, skipna=False), how='left', on=['main_id', 'sub_id'])
print(final_df)
最终结果:
main_id irrelevant_field sub_id sequence
0 4G5KW5 WT 1 5
1 4G5KW5 WR 2 10
2 4G5KW5 WR 3 210
3 4G5KW5 WG 4 215
4 T3OFB2 WT 1 5
5 T3OFB2 WT 2 10
6 T3OFB2 WG 3 15
7 T3OFB2 WT 4 20
8 T3OFB2 WS 5 25
9 T3OFB2 WS 6 85
10 T3OFB2 WS 7 90
11 T3OFB2 WT 8 95
12 T3OFB2 WR 9 100
要获得sub_id
的上一个值,请使用shift
–它将整列下移一行:
>>> df['prev_sub_id'] = df.sub_id.shift(fill_value=0)
>>> df.head()
main_id irrelevant_field sub_id prev_sub_id
0 4G5KW5 WT 1 0
1 4G5KW5 WR 2 1
2 4G5KW5 WR 3 2
3 4G5KW5 WG 4 3
4 T3OFB2 WT 1 4
但是你似乎并不真的需要这个。
相反,您可以使用累积和。将5
设置为每行要添加的默认值。然后,更新要增量的值不同的行:
>>> df['sequence'] = 5
>>> df.loc[df.main_id.str.startswith('4') & df.sub_id.eq(3), 'sequence'] = 200
>>> df.loc[df.main_id.str.startswith('T') & df.sub_id.eq(6), 'sequence'] = 60
>>> df.head()
main_id irrelevant_field sub_id prev_sub_id sequence
0 4G5KW5 WT 1 0 5
1 4G5KW5 WR 2 1 5
2 4G5KW5 WR 3 2 200
3 4G5KW5 WG 4 3 5
4 T3OFB2 WT 1 4 5
最后,将它们与cumsum
相加。
>>> df['sequence'] = df.sequence.cumsum()
>>> df
main_id irrelevant_field sub_id sequence
0 4G5KW5 WT 1 5
1 4G5KW5 WR 2 10
2 4G5KW5 WR 3 210
3 4G5KW5 WG 4 215
4 T3OFB2 WT 1 220
5 T3OFB2 WT 2 225
6 T3OFB2 WG 3 230
7 T3OFB2 WT 4 235
8 T3OFB2 WS 5 240
9 T3OFB2 WS 6 300
10 T3OFB2 WS 7 305
11 T3OFB2 WT 8 310
12 T3OFB2 WR 9 315
可以对组执行操作,例如
def set_sequence(frame):
frame['sequence'] = 5
frame.loc[frame.main_id.str.startswith('4') & frame.sub_id.eq(3), 'sequence'] = 200
frame.loc[frame.main_id.str.startswith('T') & frame.sub_id.eq(6), 'sequence'] = 60
frame['sequence'] = frame.sequence.cumsum()
return frame
>>> df.groupby('main_id').apply(set_sequence)
main_id irrelevant_field sub_id sequence
0 4G5KW5 WT 1 5
1 4G5KW5 WR 2 10
2 4G5KW5 WR 3 210
3 4G5KW5 WG 4 215
4 T3OFB2 WT 1 5
5 T3OFB2 WT 2 10
6 T3OFB2 WG 3 15
7 T3OFB2 WT 4 20
8 T3OFB2 WS 5 25
9 T3OFB2 WS 6 85
10 T3OFB2 WS 7 90
11 T3OFB2 WT 8 95
12 T3OFB2 WR 9 100