如何利用numpy矢量化根据中间列中的值将一列的值映射到另一列



我有三列:[whiteResult, blackResult, poiColor]。我的数据本质是国际象棋比赛结果

在我的情况下,poi是个人兴趣

poiColor取两个值[white, black](分类列(。

whiteResult和blackResult是具有7/8个不同类别[win, resigned, checkmated, abandoned,.....]的分类列

我想创建一个新的列poiResult:如果poiColor是白色的,它会复制whiteResult的结果,反之亦然。

使用df.loc,itertuples的速度非常慢。关于如何使用numpy矢量化、df.apply…(任何更快的替代方案(来实现这一点,有什么建议吗?

样本数据:


blackResult whiteResult poiColor
0   resigned    win white
1   resigned    win white
2   win resigned    black
3   win checkmated  black
4   checkmated  win white
5   resigned    win white
6   resigned    win white
7   checkmated  win black
8   timeout win black
9   win resigned    black

我当前的解决方案(非常慢(

df.loc[df['poiColor']=='white', 'poiResult'] = df['whiteResult']
df.loc[df['poiColor']=='black', 'poiResult'] = df['blackResult']

我希望进行一些优化,以增强此代码的性能。

我有1000万行,使用这个解决方案很痛苦。即使使用Dask

使用np.where

import numpy as np
df['poiResult'] = np.where(df['poiColor'] == "white", 
df['whiteResult'], df['blackResult'])
>>> df
blackResult whiteResult poiColor   poiResult
0    resigned         win    white         win
1    resigned         win    white         win
2         win    resigned    black         win
3         win  checkmated    black         win
4  checkmated         win    white         win
5    resigned         win    white         win
6    resigned         win    white         win
7  checkmated         win    black  checkmated
8     timeout         win    black     timeout
9         win    resigned    black         win

性能:对于10000000行

cats = ["win", "resigned", "checkmated", "abandoned"]
df1 = pd.DataFrame({"blackResult": np.random.choice(cats, 10000000),
"whiteResult": np.random.choice(cats, 10000000),
"poiColor": np.random.choice(["white", "black"], 10000000)})
%timeit np.where(df1['poiColor'] == "white", df1['whiteResult'], df1['blackResult'])
1.21 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

最新更新