str.split(',', expand = True) 也是...慢,如何提升性能?



我有以下数据框架:

SFDC,SID,MID,ACT
DC02,SID1,GOAL,"view_goal_list"
DC02,SID1,GOAL,"view_goal_card,expand_people_selector_panel"
DC02,SID1,GOAL,"view_goal_list,select_user,click_add_activity"

,我想将ACT列转换为以下格式:

SFDC、SID、中期,步骤1、步骤2、步骤3
DC02, SID1,目标,view_goal_list, na naDC02 SID1,目标,view_goal_card expand_people_selector_panel, naDC02 SID1,目标,view_goal_list、select_user click_add_activity

这是我使用的代码,在功能上它工作,但是当它处理大约5000k条记录(需要几个小时)时,性能太糟糕了。

df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")

是否有专家可以帮助提供快速性能的解决方案?

你也许可以把它放低一点…

import pandas as pd
df = pd.read_csv('split.txt')
# 'split.txt' is the example data given in the question copied over and over
print(df.shape)
print(df.head())
(50000, 4)
SFDC   SID   MID                                            ACT
0  DC02  SID1  GOAL                                 view_goal_list
1  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel
2  DC02  SID1  GOAL  view_goal_list,select_user,click_add_activity
3  DC02  SID1  GOAL                                 view_goal_list
4  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel

time for me:

[172ms]当前方法:

%%timeit
df = pd.read_csv('split.txt')
df = df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")
df = df.reset_index()
# 172 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[152 ms]分隔,分割和连接(稍快):

%%timeit
df = pd.read_csv('split.txt')
s = df['ACT'].str.split(',', expand=True)
s = s.add_prefix('step_')
df = df.join(s)
# 152 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[93 ms] Apply是快速的,因为总的来说它的进出函数更快:

%%timeit
df = pd.read_csv('split.txt')
def splitCol(s):
return s.split(',')
s = df['ACT'].apply(splitCol).to_list()
s = pd.DataFrame(s)
s = s.add_prefix('step_')
# if required comment out the above line and instead rename columns 0,1,2,3 etc. to step_1, step_2, etc. rather than zero
#s.columns = ['step_' + str(col+1) for col in s.columns]
df = df.join(s)
# 93 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[90.3 ms] Straight str.split().tolist() and join

似乎是最快的(给定±3.64 ms)。稍微有点歪斜,因为对于这个代码块s.columns = ['step_' + str(col+1) for col in s.columns]s = s.add_prefix('step_')

%%timeit
df = pd.read_csv('split.txt')
def splitCol(x):
return pd.Series(x.split(','))
s = pd.DataFrame()
s = df['ACT'].str.split(',').to_list()
s = pd.DataFrame(s)
# this seems quicker than s = s.add_prefix('step_')
s.columns = ['step_' + str(col+1) for col in s.columns]
df = df.join(s)
# 90.3 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

示例输出:

print(df.head())
SFDC   SID   MID                                            ACT  
0  DC02  SID1  GOAL                                 view_goal_list   
1  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel   
2  DC02  SID1  GOAL  view_goal_list,select_user,click_add_activity   
3  DC02  SID1  GOAL                                 view_goal_list   
4  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel   
step_0                        step_1              step_2  
0  view_goal_list                          None                None  
1  view_goal_card  expand_people_selector_panel                None  
2  view_goal_list                   select_user  click_add_activity  
3  view_goal_list                          None                None  
4  view_goal_card  expand_people_selector_panel                None 

如果您需要新的列从step_1而不是step_0开始,而不是:

s = s.add_prefix('step_')

使用:

# rename columns 0,1,2,3 etc. to step_1, step_2, etc.
s.columns = ['step_' + str(col+1) for col in s.columns]

这段代码是为了使用Pandas实现良好的性能而编写的,没有涉及循环或lambda。

为了进行测试,我重复了相同的三条记录,形成了一个500万行的DataFrame。在笔记本电脑上,对5M条记录执行此操作的超时时间大约为10秒。

当你写你花了几个小时,我猜你的行不是只有3步,而是更多——可能是数百或数千。

如果你有几个行,很多步骤,你主要会遇到内存复制问题,我想。那么如何解决这个问题不是试图从Pandas中获得更多的性能(已经接近极限),而是考虑您真正想要如何表示您的数据。您是否有几行有很多步骤,而很多行有很少的步骤,所以最终您的表基本上是空的?

也许通过运行df.ACT.str.count(",").describe()来查看步骤数的分布,然后决定要做什么,例如根据步骤数将DataFrame分成组。

相关内容

最新更新