我有两个包含NBA球队统计数据的数据集,一个按第一队或第二队随机组织,另一个按一队在主场和二队在客场组织。
一个数据集包含统计数据,另一个包含相关的投注数据,并且形状和顺序不同。
这是第一个数据集的结构,每个游戏中的每支球队都有相关的资金线。
Date_team1 ML_team1 Team.id_team1 Final_team1 Date_team2 ML_team2 Team.id_team2 Final_team2
0 2007-11-03 130 23.0 88 2007-11-03 -150 3.0 93
1 2007-11-04 -125 28.0 95 2007-11-04 105 2.0 98
2 2007-11-04 -300 13.0 115 2007-11-04 250 21.0 101
3 2007-11-04 -425 20.0 97 2007-11-04 355 18.0 93
4 2007-11-04 -160 16.0 88 2007-11-04 140 4.0 90
5 2007-11-04 -350 9.0 92 2007-11-04 290 1.0 91
6 2007-11-04 -215 8.0 88 2007-11-04 185 19.0 93
7 2007-11-04 -350 24.0 103 2007-11-04 290 6.0 92
8 2007-11-04 -115 14.0 119 2007-11-04 -105 29.0 109
这是第二个数据集,包含了更多实际比赛的结果。
game.id team.id_team1 game.date_team1 pts_team1 team.id_team2 game.date_team2 pts_team2
41 21980 3 2007-11-03 65.0 23 2007-11-03 88.0
42 23273 2 2007-11-04 98.0 28 2007-11-04 95.0
43 22599 16 2007-11-04 88.0 4 2007-11-04 90.0
44 23409 13 2007-11-04 115.0 21 2007-11-04 101.0
45 21981 6 2007-11-04 92.0 24 2007-11-04 103.0
46 23138 14 2007-11-04 119.0 29 2007-11-04 109.0
47 22733 18 2007-11-04 93.0 20 2007-11-04 97.0
48 22872 1 2007-11-04 91.0 9 2007-11-04 92.0
49 24009 8 2007-11-04 88.0 19 2007-11-04 93.0
我希望将资金线(ML_team1和ML_team2列(与第二个数据集的统计数据合并,但鉴于team1团队不一致,我一直在寻找一种方法,首先比较日期,然后在该日期内进行比较,是团队。id用于货币线第一团队或第二团队币线。
编辑:
更清楚的是,在发布的示例数据集中,第一个数据集中的team1是team23,其ML值为130,而在第二个数据集中,team23被标记为team2。我想添加一个名为"ML_team2'的列,在本例中其值为130。
数据的收集方式使得每个数据集中的team1和team2不一致。
到目前为止,我所尝试的是这个混乱的条件:
if full_df['Team.id_team1'] == stat_data['team.id_team1']:
stat_data['ML_team1'] = full_df['ML_team1']
else if full_df['Team.id_team1'] == stat_data['team.id_team2']:
stat_data['ML_team2'] = full_df['ML_team1']
此代码返回此错误
ValueError: Can only compare identically-labeled Series objects
还有一个与pts不一致的地方。在1场比赛中,球队id 3(2007-11-03(得了93分,在另一场比赛中同一球队id 3,同一日期得了65分。
所以我不包括那个专栏,你可以弄清楚那个部分。
但基本上我会做的是将数据帧堆叠起来,这样每一行都是针对每个团队的,然后你可以在team.id和日期上合并:
import pandas as pd
df1 = pd.DataFrame([['2007-11-03', 130, 23.0, 88, '2007-11-03', -150, 3.0 , 93],
['2007-11-04', -125 , 28.0, 95 , '2007-11-04', 105, 2.0 , 98],
['2007-11-04', -300 , 13.0, 115, '2007-11-04', 250, 21.0, 101],
['2007-11-04', -425 , 20.0, 97 , '2007-11-04', 355, 18.0, 93],
['2007-11-04', -160 , 16.0, 88 , '2007-11-04', 140, 4.0 , 90],
['2007-11-04', -350 , 9.0 , 92 , '2007-11-04', 290, 1.0 , 91],
['2007-11-04', -215 , 8.0 , 88 , '2007-11-04', 185, 19.0, 93],
['2007-11-04', -350 , 24.0, 103, '2007-11-04', 290, 6.0 , 92],
['2007-11-04', -115 , 14.0, 119, '2007-11-04', -105, 29.0, 109]],columns=['Date_team1', 'ML_team1', 'Team.id_team1', 'Final_team1', 'Date_team2', 'ML_team2', 'Team.id_team2', 'Final_team2'])
df2 = pd.DataFrame([[21980, 3, '2007-11-03', 65.0, 23, '2007-11-03', 88.0],
[23273, 2 , '2007-11-04', 98.0 , 28, '2007-11-04', 95.0],
[22599, 16, '2007-11-04', 88.0 , 4 , '2007-11-04', 90.0],
[23409, 13, '2007-11-04', 115.0 , 21, '2007-11-04', 101.0],
[21981, 6 , '2007-11-04', 92.0 , 24, '2007-11-04', 103.0],
[23138, 14, '2007-11-04', 119.0 , 29, '2007-11-04', 109.0],
[22733, 18, '2007-11-04', 93.0 , 20, '2007-11-04', 97.0],
[22872, 1 , '2007-11-04', 91.0 , 9 , '2007-11-04', 92.0],
[24009, 8 , '2007-11-04', 88.0 , 19, '2007-11-04', 93.0]],columns=['game.id', 'team.id_team1', 'game.date_team1', 'pts_team1', 'team.id_team2', 'game.date_team2', 'pts_team2'])
rows = []
for x in ['1','2']:
rows += df1[[col for col in df1.columns if x in col]].values.tolist()
df1 = pd.DataFrame(rows,columns=['game.date','ML_team','team.id','pts_team'])
df1 = df1.drop('pts_team',axis=1)
rows = []
for x in ['1','2']:
rows += df2[['game.id'] + [col for col in df2.columns if x in col]].values.tolist()
df2 = pd.DataFrame(rows,columns=['game.id','team.id','game.date','pts_team'])
df2 = df2.drop('pts_team',axis=1)
df = df1.merge(df2, how='left', on=['team.id','game.date'])
输出:
print (df)
game.date ML_team team.id game.id
0 2007-11-03 130 23.0 21980
1 2007-11-04 -125 28.0 23273
2 2007-11-04 -300 13.0 23409
3 2007-11-04 -425 20.0 22733
4 2007-11-04 -160 16.0 22599
5 2007-11-04 -350 9.0 22872
6 2007-11-04 -215 8.0 24009
7 2007-11-04 -350 24.0 21981
8 2007-11-04 -115 14.0 23138
9 2007-11-03 -150 3.0 21980
10 2007-11-04 105 2.0 23273
11 2007-11-04 250 21.0 23409
12 2007-11-04 355 18.0 22733
13 2007-11-04 140 4.0 22599
14 2007-11-04 290 1.0 22872
15 2007-11-04 185 19.0 24009
16 2007-11-04 290 6.0 21981
17 2007-11-04 -105 29.0 23138