我有以下数据框架:
SFDC,SID,MID,ACT
DC02,SID1,GOAL,"view_goal_list"
DC02,SID1,GOAL,"view_goal_card,expand_people_selector_panel"
DC02,SID1,GOAL,"view_goal_list,select_user,click_add_activity"
,我想将ACT列转换为以下格式:
SFDC、SID、中期,步骤1、步骤2、步骤3
DC02, SID1,目标,view_goal_list, na naDC02 SID1,目标,view_goal_card expand_people_selector_panel, naDC02 SID1,目标,view_goal_list、select_user click_add_activity
这是我使用的代码,在功能上它工作,但是当它处理大约5000k条记录(需要几个小时)时,性能太糟糕了。
df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")
是否有专家可以帮助提供快速性能的解决方案?
你也许可以把它放低一点…
import pandas as pd
df = pd.read_csv('split.txt')
# 'split.txt' is the example data given in the question copied over and over
print(df.shape)
print(df.head())
(50000, 4)
SFDC SID MID ACT
0 DC02 SID1 GOAL view_goal_list
1 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
2 DC02 SID1 GOAL view_goal_list,select_user,click_add_activity
3 DC02 SID1 GOAL view_goal_list
4 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
time for me:
[172ms]当前方法:
%%timeit
df = pd.read_csv('split.txt')
df = df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")
df = df.reset_index()
# 172 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[152 ms]分隔,分割和连接(稍快):
%%timeit
df = pd.read_csv('split.txt')
s = df['ACT'].str.split(',', expand=True)
s = s.add_prefix('step_')
df = df.join(s)
# 152 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[93 ms] Apply是快速的,因为总的来说它的进出函数更快:
%%timeit
df = pd.read_csv('split.txt')
def splitCol(s):
return s.split(',')
s = df['ACT'].apply(splitCol).to_list()
s = pd.DataFrame(s)
s = s.add_prefix('step_')
# if required comment out the above line and instead rename columns 0,1,2,3 etc. to step_1, step_2, etc. rather than zero
#s.columns = ['step_' + str(col+1) for col in s.columns]
df = df.join(s)
# 93 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[90.3 ms] Straight str.split().tolist() and join
似乎是最快的(给定±3.64 ms)。稍微有点歪斜,因为对于这个代码块s.columns = ['step_' + str(col+1) for col in s.columns]
比s = s.add_prefix('step_')
快
%%timeit
df = pd.read_csv('split.txt')
def splitCol(x):
return pd.Series(x.split(','))
s = pd.DataFrame()
s = df['ACT'].str.split(',').to_list()
s = pd.DataFrame(s)
# this seems quicker than s = s.add_prefix('step_')
s.columns = ['step_' + str(col+1) for col in s.columns]
df = df.join(s)
# 90.3 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
示例输出:
print(df.head())
SFDC SID MID ACT
0 DC02 SID1 GOAL view_goal_list
1 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
2 DC02 SID1 GOAL view_goal_list,select_user,click_add_activity
3 DC02 SID1 GOAL view_goal_list
4 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
step_0 step_1 step_2
0 view_goal_list None None
1 view_goal_card expand_people_selector_panel None
2 view_goal_list select_user click_add_activity
3 view_goal_list None None
4 view_goal_card expand_people_selector_panel None
如果您需要新的列从step_1
而不是step_0
开始,而不是:
s = s.add_prefix('step_')
使用:
# rename columns 0,1,2,3 etc. to step_1, step_2, etc.
s.columns = ['step_' + str(col+1) for col in s.columns]
这段代码是为了使用Pandas实现良好的性能而编写的,没有涉及循环或lambda。
为了进行测试,我重复了相同的三条记录,形成了一个500万行的DataFrame。在笔记本电脑上,对5M条记录执行此操作的超时时间大约为10秒。
当你写你花了几个小时,我猜你的行不是只有3步,而是更多——可能是数百或数千。
如果你有几个行,很多步骤,你主要会遇到内存复制问题,我想。那么如何解决这个问题不是试图从Pandas中获得更多的性能(已经接近极限),而是考虑您真正想要如何表示您的数据。您是否有几行有很多步骤,而很多行有很少的步骤,所以最终您的表基本上是空的?
也许通过运行df.ACT.str.count(",").describe()
来查看步骤数的分布,然后决定要做什么,例如根据步骤数将DataFrame分成组。