比较两个不同大小的数据框的各种(但不是全部)列,并且只从一个数据框中选择条件为真的行



我有两个数据帧,它们有不同的行数和不同的列数。

row_List1:

date   team_home  team_away   goals_home   goals_away   shootout_win   competition

1 2018-06-04 India Kenya 3 0 NaN Friendly 2018
2 2018-06-06 Armenia Moldova 0 0 NaN Friendly 2018
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018

row_List2:

date team_home team_away goals_home goals_away shootout_win competition venue

1 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
2 2018-06-05 USA Pakistan 8 5 NaN Friendly 2018 Nuetral
3 2018-06-06 Moldova Armenia 0 0 NaN Friendly 2018 Away
4 2018-06-07 India Srilanka 2 0 NaN Friendly 2018 Home
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018 Away
6 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home

所以row_List2比row_List1有更多的列和行。

row_List2包含所有比赛的场地。我需要从row_List2中删除所有游戏,这些游戏不在row_List1中。

所以基本上我需要比较两个数据框的日期,比较row_List1的team_home与row_List2的team_home和team_away,以及row_List1的team_away。

我尝试了下面的代码:

# row_list1['venue'] = np.where(((row_list1['date'] == row_list2['date']) and (row_list1['team_home'] == row_list2['team_home'] or row_list1['team_home'] == row_list2['team_away']) and (row_list1['team_away'] == row_list2['team_away'] or row_list1['team_away'] == row_list2['team_home']) and (row_list1['goals_home'] == row_list2['goals_home'] or row_list1['goals_home'] == row_list2['goals_away']) and (row_list1['goals_away'] == row_list2['goals_away'] or row_list1['goals_away'] == row_list2['goals_home'])), row_list2['venue'], np.NaN)

这些是我需要的条件,但是上面的代码给了我一个错误:

ValueError: Can only compare identically-labeled Series objects

另一个问题是,team_home和team_away可能在row_List2中被交换。所以我需要检查:

如果row_list1 [' team_home ') = = row_list2 [' team_home ']或row_list1 [' team_home '] = = row_list2 [' team_away '])和(row_list1 [' team_away '] = = row_list2 [' team_away ']或row_list1 [' team_away '] = = row_list2 [' team_home '])和(row_list1 [' goals_home '] = = row_list2 [' goals_home ']或row_list1 [' goals_home '] = = row_list2 [' goals_away '])和(row_list1 [' goals_away '] = = row_list2 [' goals_away ']或row_list1 [' goals_away '] = = row_list2 [' goals_home '])

我想要的输出是:

row_List2:

date   team_home  team_away   goals_home   goals_away   shootout_win   competition     venue

1 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
2 2018-06-06 Moldova Armenia 0 0 NaN Friendly 2018 Away
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018 Away

有人能帮忙吗?

这是一种hack,但它的工作。请注意,亚美尼亚-摩尔多瓦的比赛实际上并不匹配你的数据帧(他们是主客场翻转)。因为np.nan不==np.nan,所以在进行比较之前,我不得不使用.fillna()

>>> for df in [df1, df2]:
...    df.fillna(0, inplace=True)
>>> df1[[df2.drop('venue', axis=1).eq(r).all(axis=1).any() for r in df1.itertuples(index=False)]]
date    team_home   team_away   goals_home  goals_away  shootout_win    competition year
0   2018-06-04  India   Kenya   3   0   0.0 Friendly    2018
2   2018-06-09  Italy   Netherlands 1   1   0.0 Friendly    2018

这是你要找的吗?

df = pd.merge(row_List1,row_List2.drop_duplicates(),how = 'left')

输出:

date team_home    team_away  goals_home  goals_away  shootout_win  
0  6/4/2018     India        Kenya           3           0           NaN   
1  6/6/2018   Armenia      Moldova           0           0           NaN   
2  6/9/2018     Italy  Netherlands           1           1           NaN   
competition  year venue  
0    Friendly  2018  Home  
1    Friendly  2018   NaN  
2    Friendly  2018  Away

相关内容

最新更新