使用熊猫有效地计算时间特征



我有以下.csv文件:

Match_idx,Date,Player_1,Player_2,Player_1_wins
0,2020-01-01,p1,p2,1
1,2020-01-02,p2,p3,0
2,2020-01-03,p3,p1,1
3,2020-01-04,p4,p1,1

我想计算更多的列以获得以下输出.csv文件:

Match_idx,Date,Player_1,Player_2,Player_1_wins,Player_1_winrate,Player_2_winrate,Player_1_matches,Player_2_matches,Head_to_head
0,2020-01-01,p1,p2,1,0,0,0,0,0,''
1,2020-01-02,p2,p3,0,0,0,1,0,0,''
2,2020-01-03,p3,p1,1,1,1,1,1,0,''
3,2020-01-04,p4,p1,1,0,1/2,0,2,0,''
4,2020-01-05,p1,p3,0,1/2,2/2,3,2,'0'
5,2020-01-06,p3,p1,1,1/3,3/3,4,3,'11'

每列的语义:

  • Match_idxDatePlayer_1Player_2:直截了当
  • Player_1_winsPlayer_1赢了比赛吗?1:0

这些列将被维护,我想添加这些列:

  • Player_1_winrate:number_of_wins_for_player_1_before_this_one/number_of_tatches\layerd_by_player_1before_tthis_one

  • Player_2_winrate:与播放器_2 相同

  • Player_1_matches:一个之前的匹配次数

  • Player_2_matches:与上面播放器_2 相同

  • Head_to_headPlayer_1Player_2之间先前匹配的结果。如果Player_1赢得比赛,则编码为{"0"one_answers"1"}的字符串,其中包含"1",否则为"0"。

我做了什么

我正在使用pandas库来操作这个文件。我一直在想的天真的方法是:选择每一场比赛,输赢,由一名球员进行,并按日期排序。之后,对于胜率特征,将以下两个函数应用于匹配。

def get_matches_won_before_by_player(df: pd.DataFrame, player: str, before: str):
mask_player_won = (
((df['Player_1_wins'] == 1) & (df['Player_1'] == player)) | 
((df['Player_1_wins'] == 0) & (df['Player_2'] == player))
)
req = df[(df['Date'] < before) & mask_player_won]
req.sort_values(by='Date', inplace=True)
return req
def get_matches_played_before_by_player(df: pd.DataFrame, player: str, before: str):
mask_player_played = (
(df['Player_1'] == player) | 
(df['Player_2'] == player)
)
req = df[(df['Date'] < before) & mask_player_played]
req.sort_values(by='Date', inplace=True)
return req

我可以将该逻辑应用于每个匹配,但这将涉及为每个匹配运行这些函数,这是非常非常无效的。

我想做什么

如何仅使用给定比赛中每个玩家的最后一场比赛来高效地计算我的特征?例如,更新每个玩家的胜率可以用以下逻辑来完成:

  1. 将每列初始化为0
  2. 更新胜率如下:(M/M+1(+(W/N+1(,其中M为当前胜率,N为当前比赛次数,如果玩家获胜,则W=1,否则为0

非常感谢您对组织这样一个过程的任何帮助或想法。

我尝试对系列进行操作,以使解决方案快速运行。我将通过代码中的注释进行解释。

# to return head to head
strp1gw = ""
def get_head_to_head(s):
global strp1gw
strp1gw +=str(s)
return strp1gw
(
df = df
.assign(
# this is player 1 all wins before but to avoid creating extra columns I named it as Player_1_winrate to replace it with rate as you dont need cumulative sum of wins
Player_1_winrate = lambda x: x['Player_1_wins'].cumsum(),
# if player 1 played?
Player_1_matches = lambda x: np.where((x['Player_1'] =='p1') | (x['Player_2'] == 'p1'),1,0)
)
# this is number of matches played by player 1 before this one
.assign(Player_1_matches = lambda x: x['Player_1_matches'].cumsum())
# the player 1 winrate
.assign(Player_1_winrate = lambda x: x['Player_1_winrate']/x['Player_1_matches'])
# same for player 2 but you didnt mention how to compute Player_2_wins
.assign(
Player_2_winrate = lambda x: x['Player_2_wins'].cumsum(),
Player_2_matches = lambda x: np.where((x['Player_1'] =='p2') | (x['Player_2'] == 'p2'),1,0)
)
.assign(Player_2_matches = lambda x: x['Player_2_matches'].cumsum())
.assign(Player_2_winrate = lambda x: x['Player_2_winrate']/x['Player_2_matches'])
# to apply function to get head to head value
.assign(Head_to_head=lambda x: x['Player_1_wins'].apply(lambda s: get_head_to_head(s)))
)

最新更新