对熊猫表进行矢量化查找和更新



我有一个df和它对应的字典,我从一个组创建并将其附加到一个键。df有很多列,但这里是最重要的部分。

p>
key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         0
24  2012_1_23_0  0         0
...
46  2012_1_23_0  0         0
47  2012_1_23_1  1         0
47  2012_1_23_1  0         0
...
70  2012_1_23_1  0         0
71  2012_1_23_1  1         0

dict_df:

key            x
0    2012_1_23_0    1
1    2012_1_23_1    0

我使用df.to_dict['records']将字典数据帧转换为字典

dict类型:

[{'key': '2012_1_23_0', 'x': 1},
{'key': '2012_1_23_1', 'x': 0}]

字典和df都有这个键对,在'key'中。我创建了一个循环,它接受change_x变量,并使用它来增加x(如果1),并将x的变量设置为x(如果0)的键值,但对于20k行需要2.5s,而对于具有400k行的较大df,它需要3分钟以上。如果我们假设数据相同,这将是循环发生后的df和字典。

循环代码:

def search_key_in_dicts(key, dict):
for d in dict:
if d['key'] == key:
return d
return None
def update_value_in_dicts(key, dict, col, value):
dict_key = search_key_in_dicts(key, dict)
dict_key.update({col : value})
def increment_x_value(key, dict):
update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
row = data.iloc[i]
if change_x == 1:
increment_x_value(row.key, dict)
data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))
p>
key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         1
24  2012_1_23_0  0         1
...
46  2012_1_23_0  0         1
47  2012_1_23_1  1         2
48  2012_1_23_1  0         2
...
70  2012_1_23_1  0         2
71  2012_1_23_1  1         3

字典:

key            x
0    2012_1_23_0    3
1    2012_1_23_1    1

我知道循环函数是有效的,我想如果我必须再次运行这个,我可以忍受3分钟的性能时间。我想用np让它更快。Where or pd。申请,但都没有效果。这是我之前尝试过的:

np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)

但是我得到了这个错误:ValueError: The truth value of a Series is ambiguous.-我最好的猜测是因为df['key']可以映射到change_x的很多值。

apply函数也一样:

def change_x_apply(key, change_x):
if change_x== 1:
increment_x_value(key, dict)
return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)

我不知道该怎么做才能缩短运行时间。有什么建议吗?

你需要完全重写你的代码:

#create default index
df = df.reset_index(drop=True)
#counter column by key
df['g'] = df.groupby('key').cumcount()
#merge df1 (used for generate dict) by first match, if no match set 0
df['new'] = (df.merge(df1.assign(g=0)
.rename(columns={'x':'new'}), on=['key', 'g'], how='left'))['new']
.fillna(0)
.astype(int))
#sum both columns and use cumulative sum per key groups
df['x'] = (df['new'] + df['change_x']).groupby(df['key']).cumsum()
#delete helper columns, commanted for debugging
# df = df.drop(['g','new'], axis=1)
print (df)
key  change_x  x  g  new
0  2012_1_23_1         0  0  0    0
1  2012_1_23_1         0  0  1    0
2  2012_1_23_0         1  2  0    1
3  2012_1_23_0         0  2  1    0
4  2012_1_23_0         0  2  2    0
5  2012_1_23_1         1  1  2    0
6  2012_1_23_1         0  1  3    0
7  2012_1_23_1         0  1  4    0
8  2012_1_23_1         1  2  5    0
df1 = df.drop_duplicates('key', keep='last')[['key','x']]
print (df1)
key  x
4  2012_1_23_0  2
8  2012_1_23_1  2

最新更新