使用apply()处理Pandas行的速度非常慢



我有两个数据帧,df1(500万行(&df2(仅约150行(。我想查找df2并根据df2中的数据更新df1。基本上,我有几个条件需要以DataFrame df1的形式应用于df1。如果df2中的条件在df1中满足,则设置一个标志。

data1 = {'type':['1','2','3'],
'code':['A','B','C'],
'type_2':['A1', 'B1', 'C1'],
'num1': ['101','102', '103'],
'num2': ['','',''],
'p_cd':['AA', 'BB', 'CC']
}
df1 = pd.DataFrame(data1)
data2 = {'type':['1','2'],
'code':['A','B'],
'type_2':['', 'B1'],
'num1': ['','102'],
'num2': ['',''],
'custom_expression':["p_cd=='AA'", ''],
'delete_flag':['Y', 'Y']
}
df2 = pd.DataFrame(data2)
df_cols = df1.columns.tolist()
def delete_flag(row_from_outlier):
for index, row_from_delete in  df2.iterrows():
if row_from_delete['custom_expression'].strip():
for col in df_cols:
if row_from_delete['custom_expression'].find(col) != -1:
print(row_from_delete['custom_expression'])
cust_exp = row_from_delete['custom_expression'].replace(col, "row_from_outlier['"+col+"']" )
else:
cust_exp = '1==1'
print(cust_exp)
if (
( (not row_from_delete['type'].strip()  ) or (row_from_delete['type'] ==row_from_outlier['type'] )  )
and ( (not row_from_delete['code'].strip()  ) or (row_from_delete['code'] == row_from_outlier['code'] )               )
and ( (not row_from_delete['type_2'].strip()) or (row_from_delete['type_2'] == row_from_outlier['type_2'] ) )
and ( (not row_from_delete['num1'].strip()  ) or (row_from_delete['num1'] == row_from_outlier['num1'] ) )
and ( (not row_from_delete['num2'].strip()  ) or (row_from_delete['num2'] == row_from_outlier['num2'] ) )
and eval(cust_exp)
) :
del_flg = row_from_delete['delete_flag']
if not 'del_flg' in locals():
del_flg = 'N/A'
return del_flg
df1['delete_flag'] = df1.apply(lambda row:delete_flag(row), axis=1)

当对一小部分数据进行测试时,上面的代码运行良好,但速度不够快,无法处理500万条记录。

发现这个Cython、Numba和pandas.eval((推荐可能有效,但我对Python还很陌生,不知道如何将上面的代码转换为使用Cython

一些可以帮助您加快速度的观察结果。

你重复做很多事情:

剥离delete_flag:内部

只需执行一次,在函数之外,在开始时执行:

for col in df1.columns:
df1[col] = df1[col].str.strip()
for col in df2.columns:
df2[col] = df2[col].str.strip()

建筑cust_exp

的这部分功能

...
for index, row_from_delete in df2.iterrows():
if row_from_delete['custom_expression'].strip():
for col in df_cols:
if row_from_delete['custom_expression'].find(col) != -1:
cust_exp = row_from_delete['custom_expression'].replace(col, "row_from_outlier['"+col+"']" )
else:
cust_exp = '1==1'
...

实际上独立于函数的参数row_from_outlier!因此,您正在一次又一次地构建相同的cust_exp。这意味着你应该只做一次,在函数之外。要做到这一点,你可以尝试

df2['cust_exp'] = df2.custom_expression.str.replace(
'|'.join(str(col) for col in df1.columns),
lambda m: 'row_from_outlier["' + m.group(0) + '"]',
regex=True
)
df2.cust_exp[df2.cust_exp == ''] = 'True'

其将结果附加到df2(print(df2.cust_exp)(:

0    row_from_outlier["p_cd"]=='AA'
1                              True

为了进一步";去循环";您可以执行以下功能:

from functools import reduce
from operator import and_
def delete_flag(row_from_outlier):
scope = locals()
ser = reduce(and_, [df2[col].isin(['', row_from_outlier[col]])
for col in df2.columns[:5]]
+ [df2.cust_exp.apply(lambda s: eval(s, scope))])
if ser.any():
return df2.delete_flag[ser].to_list()[-1]
return 'N/A'

在这里我已经使用了这个

( (not row_from_delete['type'].strip()  )
or (row_from_delete['type'] == row_from_outlier['type'] ) )

相当于

row_from_delete['type'].isin(['', row_from_outlier['type']])

并且该操作可以按列进行

df2['type'].isin(['', row_from_outlier['type']])

柱的CCD_ 7-聚集是由CCD_。

最后但同样重要的一个问题是:在我看来,您的代码中有可能del_flg被分配多次,而只有最后一个被使用?这似乎很奇怪。这是有意的吗?

最新更新