我有三列:[whiteResult, blackResult, poiColor]
。我的数据本质是国际象棋比赛结果
在我的情况下,poi是个人兴趣
poiColor取两个值[white, black]
(分类列(。
whiteResult和blackResult是具有7/8个不同类别[win, resigned, checkmated, abandoned,.....]
的分类列
我想创建一个新的列poiResult:如果poiColor是白色的,它会复制whiteResult的结果,反之亦然。
使用df.loc,itertuples的速度非常慢。关于如何使用numpy矢量化、df.apply…(任何更快的替代方案(来实现这一点,有什么建议吗?
样本数据:
blackResult whiteResult poiColor
0 resigned win white
1 resigned win white
2 win resigned black
3 win checkmated black
4 checkmated win white
5 resigned win white
6 resigned win white
7 checkmated win black
8 timeout win black
9 win resigned black
我当前的解决方案(非常慢(
df.loc[df['poiColor']=='white', 'poiResult'] = df['whiteResult']
df.loc[df['poiColor']=='black', 'poiResult'] = df['blackResult']
我希望进行一些优化,以增强此代码的性能。
我有1000万行,使用这个解决方案很痛苦。即使使用Dask
使用np.where
import numpy as np
df['poiResult'] = np.where(df['poiColor'] == "white",
df['whiteResult'], df['blackResult'])
>>> df
blackResult whiteResult poiColor poiResult
0 resigned win white win
1 resigned win white win
2 win resigned black win
3 win checkmated black win
4 checkmated win white win
5 resigned win white win
6 resigned win white win
7 checkmated win black checkmated
8 timeout win black timeout
9 win resigned black win
性能:对于10000000行
cats = ["win", "resigned", "checkmated", "abandoned"]
df1 = pd.DataFrame({"blackResult": np.random.choice(cats, 10000000),
"whiteResult": np.random.choice(cats, 10000000),
"poiColor": np.random.choice(["white", "black"], 10000000)})
%timeit np.where(df1['poiColor'] == "white", df1['whiteResult'], df1['blackResult'])
1.21 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)