Pandas融合了许多条件句和不一致的数据



我有两个包含NBA球队统计数据的数据集,一个按第一队或第二队随机组织,另一个按一队在主场和二队在客场组织。

一个数据集包含统计数据,另一个包含相关的投注数据,并且形状和顺序不同。

这是第一个数据集的结构,每个游戏中的每支球队都有相关的资金线。

Date_team1  ML_team1    Team.id_team1   Final_team1 Date_team2  ML_team2    Team.id_team2   Final_team2
0   2007-11-03   130        23.0            88          2007-11-03  -150        3.0           93
1   2007-11-04  -125        28.0            95          2007-11-04   105        2.0           98
2   2007-11-04  -300        13.0            115         2007-11-04   250        21.0         101
3   2007-11-04  -425        20.0            97          2007-11-04   355        18.0          93
4   2007-11-04  -160        16.0            88          2007-11-04   140        4.0           90
5   2007-11-04  -350        9.0             92          2007-11-04   290        1.0           91
6   2007-11-04  -215        8.0             88          2007-11-04   185        19.0          93
7   2007-11-04  -350        24.0            103         2007-11-04   290        6.0           92
8   2007-11-04  -115        14.0            119         2007-11-04  -105        29.0         109

这是第二个数据集,包含了更多实际比赛的结果。

game.id team.id_team1   game.date_team1 pts_team1   team.id_team2   game.date_team2 pts_team2
41  21980   3               2007-11-03      65.0        23              2007-11-03      88.0
42  23273   2               2007-11-04      98.0        28              2007-11-04      95.0
43  22599   16              2007-11-04      88.0        4               2007-11-04      90.0
44  23409   13              2007-11-04      115.0       21              2007-11-04      101.0
45  21981   6               2007-11-04      92.0        24              2007-11-04      103.0
46  23138   14              2007-11-04      119.0       29              2007-11-04      109.0
47  22733   18              2007-11-04      93.0        20              2007-11-04      97.0
48  22872   1               2007-11-04      91.0        9               2007-11-04      92.0
49  24009   8               2007-11-04      88.0        19              2007-11-04      93.0

我希望将资金线(ML_team1和ML_team2列(与第二个数据集的统计数据合并,但鉴于team1团队不一致,我一直在寻找一种方法,首先比较日期,然后在该日期内进行比较,是团队。id用于货币线第一团队第二团队币线

编辑:

更清楚的是,在发布的示例数据集中,第一个数据集中的team1是team23,其ML值为130,而在第二个数据集中,team23被标记为team2。我想添加一个名为"ML_team2'的列,在本例中其值为130。

数据的收集方式使得每个数据集中的team1和team2不一致。

到目前为止,我所尝试的是这个混乱的条件:

if full_df['Team.id_team1'] == stat_data['team.id_team1']:
stat_data['ML_team1'] = full_df['ML_team1']
else if full_df['Team.id_team1'] == stat_data['team.id_team2']:
stat_data['ML_team2'] = full_df['ML_team1']

此代码返回此错误

ValueError: Can only compare identically-labeled Series objects

还有一个与pts不一致的地方。在1场比赛中,球队id 3(2007-11-03(得了93分,在另一场比赛中同一球队id 3,同一日期得了65分。

所以我不包括那个专栏,你可以弄清楚那个部分。

但基本上我会做的是将数据帧堆叠起来,这样每一行都是针对每个团队的,然后你可以在team.id和日期上合并:

import pandas as pd 

df1 = pd.DataFrame([['2007-11-03',   130,        23.0,            88,         '2007-11-03',  -150,        3.0 ,          93],
['2007-11-04',  -125 ,       28.0,            95 ,         '2007-11-04',   105,        2.0 ,          98],
['2007-11-04',  -300 ,       13.0,            115,         '2007-11-04',   250,        21.0,         101],
['2007-11-04',  -425 ,       20.0,            97 ,         '2007-11-04',   355,        18.0,          93],
['2007-11-04',  -160 ,       16.0,            88 ,         '2007-11-04',   140,        4.0 ,          90],
['2007-11-04',  -350 ,       9.0 ,            92 ,         '2007-11-04',   290,        1.0 ,          91],
['2007-11-04',  -215 ,       8.0 ,            88 ,         '2007-11-04',   185,        19.0,          93],
['2007-11-04',  -350 ,       24.0,            103,         '2007-11-04',   290,        6.0 ,          92],
['2007-11-04',  -115 ,       14.0,            119,         '2007-11-04',  -105,        29.0,         109]],columns=['Date_team1',  'ML_team1',    'Team.id_team1',   'Final_team1', 'Date_team2',  'ML_team2',    'Team.id_team2',   'Final_team2'])


df2 = pd.DataFrame([[21980,   3,               '2007-11-03',      65.0,        23,              '2007-11-03',      88.0],
[23273,   2 ,              '2007-11-04',      98.0 ,       28,              '2007-11-04',      95.0],
[22599,   16,              '2007-11-04',      88.0  ,      4 ,              '2007-11-04',      90.0],
[23409,   13,              '2007-11-04',      115.0 ,      21,              '2007-11-04',      101.0],
[21981,   6 ,              '2007-11-04',      92.0  ,      24,              '2007-11-04',      103.0],
[23138,   14,              '2007-11-04',      119.0 ,      29,              '2007-11-04',      109.0],
[22733,   18,              '2007-11-04',      93.0  ,      20,              '2007-11-04',      97.0],
[22872,   1 ,              '2007-11-04',      91.0  ,      9 ,              '2007-11-04',      92.0],
[24009,   8 ,              '2007-11-04',      88.0  ,      19,              '2007-11-04',      93.0]],columns=['game.id', 'team.id_team1',   'game.date_team1', 'pts_team1',   'team.id_team2',   'game.date_team2', 'pts_team2'])

rows = []
for x in ['1','2']:
rows += df1[[col for col in df1.columns if x in col]].values.tolist()
df1 = pd.DataFrame(rows,columns=['game.date','ML_team','team.id','pts_team'])
df1 = df1.drop('pts_team',axis=1)

rows = []
for x in ['1','2']:
rows += df2[['game.id'] + [col for col in df2.columns if x in col]].values.tolist()
df2 = pd.DataFrame(rows,columns=['game.id','team.id','game.date','pts_team'])
df2 = df2.drop('pts_team',axis=1)
df = df1.merge(df2, how='left', on=['team.id','game.date'])

输出:

print (df)
game.date  ML_team  team.id  game.id
0   2007-11-03      130     23.0    21980
1   2007-11-04     -125     28.0    23273
2   2007-11-04     -300     13.0    23409
3   2007-11-04     -425     20.0    22733
4   2007-11-04     -160     16.0    22599
5   2007-11-04     -350      9.0    22872
6   2007-11-04     -215      8.0    24009
7   2007-11-04     -350     24.0    21981
8   2007-11-04     -115     14.0    23138
9   2007-11-03     -150      3.0    21980
10  2007-11-04      105      2.0    23273
11  2007-11-04      250     21.0    23409
12  2007-11-04      355     18.0    22733
13  2007-11-04      140      4.0    22599
14  2007-11-04      290      1.0    22872
15  2007-11-04      185     19.0    24009
16  2007-11-04      290      6.0    21981
17  2007-11-04     -105     29.0    23138

最新更新