我有两个数据帧,它们有不同的行数和不同的列数。
row_List1:
date team_home team_away goals_home goals_away shootout_win competition
1 2018-06-04 India Kenya 3 0 NaN Friendly 2018
2 2018-06-06 Armenia Moldova 0 0 NaN Friendly 2018
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018
row_List2:
date team_home team_away goals_home goals_away shootout_win competition venue
1 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
2 2018-06-05 USA Pakistan 8 5 NaN Friendly 2018 Nuetral
3 2018-06-06 Moldova Armenia 0 0 NaN Friendly 2018 Away
4 2018-06-07 India Srilanka 2 0 NaN Friendly 2018 Home
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018 Away
6 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
所以row_List2比row_List1有更多的列和行。
row_List2包含所有比赛的场地。我需要从row_List2中删除所有游戏,这些游戏不在row_List1中。
所以基本上我需要比较两个数据框的日期,比较row_List1的team_home与row_List2的team_home和team_away,以及row_List1的team_away。
我尝试了下面的代码:
# row_list1['venue'] = np.where(((row_list1['date'] == row_list2['date']) and (row_list1['team_home'] == row_list2['team_home'] or row_list1['team_home'] == row_list2['team_away']) and (row_list1['team_away'] == row_list2['team_away'] or row_list1['team_away'] == row_list2['team_home']) and (row_list1['goals_home'] == row_list2['goals_home'] or row_list1['goals_home'] == row_list2['goals_away']) and (row_list1['goals_away'] == row_list2['goals_away'] or row_list1['goals_away'] == row_list2['goals_home'])), row_list2['venue'], np.NaN)
这些是我需要的条件,但是上面的代码给了我一个错误:
ValueError: Can only compare identically-labeled Series objects
另一个问题是,team_home和team_away可能在row_List2中被交换。所以我需要检查:
如果row_list1 [' team_home ') = = row_list2 [' team_home ']或row_list1 [' team_home '] = = row_list2 [' team_away '])和(row_list1 [' team_away '] = = row_list2 [' team_away ']或row_list1 [' team_away '] = = row_list2 [' team_home '])和(row_list1 [' goals_home '] = = row_list2 [' goals_home ']或row_list1 [' goals_home '] = = row_list2 [' goals_away '])和(row_list1 [' goals_away '] = = row_list2 [' goals_away ']或row_list1 [' goals_away '] = = row_list2 [' goals_home '])
我想要的输出是:
row_List2:
date team_home team_away goals_home goals_away shootout_win competition venue
1 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
2 2018-06-06 Moldova Armenia 0 0 NaN Friendly 2018 Away
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018 Away
有人能帮忙吗?
这是一种hack,但它的工作。请注意,亚美尼亚-摩尔多瓦的比赛实际上并不匹配你的数据帧(他们是主客场翻转)。因为np.nan
不==np.nan
,所以在进行比较之前,我不得不使用.fillna()
。
>>> for df in [df1, df2]:
... df.fillna(0, inplace=True)
>>> df1[[df2.drop('venue', axis=1).eq(r).all(axis=1).any() for r in df1.itertuples(index=False)]]
date team_home team_away goals_home goals_away shootout_win competition year
0 2018-06-04 India Kenya 3 0 0.0 Friendly 2018
2 2018-06-09 Italy Netherlands 1 1 0.0 Friendly 2018
这是你要找的吗?
df = pd.merge(row_List1,row_List2.drop_duplicates(),how = 'left')
输出:
date team_home team_away goals_home goals_away shootout_win
0 6/4/2018 India Kenya 3 0 NaN
1 6/6/2018 Armenia Moldova 0 0 NaN
2 6/9/2018 Italy Netherlands 1 1 NaN
competition year venue
0 Friendly 2018 Home
1 Friendly 2018 NaN
2 Friendly 2018 Away