如何合并熊猫中不同形状的数据帧



我正试图将panda中的两个数据帧与大数据集合并,但这给我带来了一些问题。我将尝试用一个较小的例子来说明。

df1有一个设备列表和几个与设备相关的列:

Item ID Equipment     Owner Status   Location
1       Jackhammer    James Active   London
2       Cement Mixer  Tim   Active   New York
3       Drill         Sarah Active   Paris
4       Ladder        Luke  Inactive Hong Kong
5       Winch         Kojo  Inactive Sydney
6       Circular Saw  Alex  Active   Moscow

df2有一个使用设备的实例列表。这有一些类似于df1的列,但有些字段是NaN值,也记录了不在df1中的设备实例:

Item ID Equipment     Owner Date       Location
1       Jackhammer    James 08/09/2020 London
1       Jackhammer    James 08/10/2020 London
2       Cement Mixer  NaN   29/02/2020 New York
3       Drill         Sarah 11/02/2020 NaN
3       Drill         Sarah 30/11/2020 NaN
3       Drill         Sarah 21/12/2020 NaN
6       Circular Saw  Alex  19/06/2020 Moscow
7       Hammer        Ken   21/12/2020 Toronto
8       Sander        Ezra  19/06/2020 Frankfurt

我希望最终得到的数据帧是:

Item ID Equipment     Owner Status   Date       Location
1       Jackhammer    James Active   08/09/2020 London
1       Jackhammer    James Active   08/10/2020 London
2       Cement Mixer  Tim   Active   29/02/2020 New York
3       Drill         Sarah Active   11/02/2020 Paris
3       Drill         Sarah Active   30/11/2020 Paris
3       Drill         Sarah Active   21/12/2020 Paris
4       Ladder        Luke  Inactive NaN        Hong Kong
5       Winch         Kojo  Inactive NaN        Sydney
6       Circular Saw  Alex  Active   19/06/2020 Moscow
7       Hammer        Ken   NaN      21/12/2020 Toronto
8       Sander        Ezra  NaN      19/06/2020 Frankfurt

相反,使用以下代码,我得到了重复的行,我认为这是因为NaN值:

data = pd.merge(df1, df2, how='outer', on=['Item ID'])
Item ID Equipment_x  Equipment_y Owner_x Owner_y Status   Date       Location_x  Location_y
1       Jackhammer   NaN         James   James   Active   08/09/2020 London      London
1       Jackhammer   NaN         James   James   Active   08/10/2020 London      London
2       Cement Mixer NaN         Tim     NaN     Active   29/02/2020 New York    New York
3       Drill        NaN         Sarah   Sarah   Active   11/02/2020 Paris       NaN
3       Drill        NaN         Sarah   Sarah   Active   30/11/2020 Paris       NaN
3       Drill        NaN         Sarah   Sarah   Active   21/12/2020 Paris       NaN
4       Ladder       NaN         Luke    NaN     Inactive NaN        Hong Kong   Hong Kong
5       Winch        NaN         Kojo    NaN     Inactive NaN        Sydney      Sydney
6       Circular Saw NaN         Alex    NaN     Active   19/06/2020 Moscow      Moscow
7       NaN          Hammer      NaN     Ken     NaN      21/12/2020 NaN         Toronto
8       NaN          Sander      NaN     Ezra    NaN      19/06/2020 NaN         Frankfurt

理想情况下,我可以只删除_y列,但底部行中的数据意味着我将丢失重要信息。相反,我唯一能想到的是合并列,并强制panda比较每列中的值,并始终支持非NaN值。我不确定这是否可能?

合并列并强制panda比较每列中的值,并始终支持非NaN值。

这就是你的意思吗?

In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])                         
In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])                            
In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])                                     
In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)                
In [49]: data                                                                                        
Out[49]: 
Item ID     Equipment    Status        Date   Location  Owner
0         1    Jackhammer    Active  08/09/2020     London  James
1         1    Jackhammer    Active  08/10/2020     London  James
2         2  Cement Mixer    Active  29/02/2020   New York    Tim
3         3         Drill    Active  11/02/2020      Paris  Sarah
4         3         Drill    Active  30/11/2020      Paris  Sarah
5         3         Drill    Active  21/12/2020      Paris  Sarah
6         4        Ladder  Inactive         NaN  Hong Kong   Luke
7         5         Winch  Inactive         NaN     Sydney   Kojo
8         6  Circular Saw    Active  19/06/2020     Moscow   Alex
9         7        Hammer       NaN  21/12/2020    Toronto    Ken
10        8        Sander       NaN  19/06/2020  Frankfurt   Ezra

(据我所知(你真的不能在空列上合并。但是,如果值是NaN,则可以使用fillna获取该值并用其他值替换它。这不是一个非常优雅的解决方案,但它似乎至少解决了您的示例。

另请参阅pandas将两列与空值组合

一般情况下,您可以按如下方式执行:

# merge the two dataframes using a suffix that ideally does
# not appear in your data    
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
second_col= f'{col}{suffix_string}'
if second_col in data.columns:
data[col]= data[second_col].where(data[col].isna(), data[col])
columns_to_remove.append(second_col)
if columns_to_remove:
data.drop(columns=columns_to_remove, inplace=True)
data

结果是:

Item_ID     Equipment  Owner    Status   Location        Date
0         1    Jackhammer  James    Active     London  08/09/2020
1         1    Jackhammer  James    Active     London  08/10/2020
2         2  Cement_Mixer    Tim    Active   New_York  29/02/2020
3         3         Drill  Sarah    Active      Paris  11/02/2020
4         3         Drill  Sarah    Active      Paris  30/11/2020
5         3         Drill  Sarah    Active      Paris  21/12/2020
6         4        Ladder   Luke  Inactive  Hong_Kong         NaN
7         5         Winch   Kojo  Inactive     Sydney         NaN
8         6  Circular_Saw   Alex    Active     Moscow  19/06/2020
9         7        Hammer    Ken       NaN    Toronto  21/12/2020
10        8        Sander   Ezra       NaN  Frankfurt  19/06/2020

关于以下测试数据:

df1= pd.read_csv(io.StringIO("""Item_ID Equipment     Owner Status   Location
1       Jackhammer    James Active   London
2       Cement_Mixer  Tim   Active   New_York
3       Drill         Sarah Active   Paris
4       Ladder        Luke  Inactive Hong_Kong
5       Winch         Kojo  Inactive Sydney
6       Circular_Saw  Alex  Active   Moscow"""), sep='s+')

df2= pd.read_csv(io.StringIO("""Item_ID Equipment     Owner Date       Location
1       Jackhammer    James 08/09/2020 London
1       Jackhammer    James 08/10/2020 London
2       Cement_Mixer  NaN   29/02/2020 New_York
3       Drill         Sarah 11/02/2020 NaN
3       Drill         Sarah 30/11/2020 NaN
3       Drill         Sarah 21/12/2020 NaN
6       Circular_Saw  Alex  19/06/2020 Moscow
7       Hammer        Ken   21/12/2020 Toronto
8       Sander        Ezra  19/06/2020 Frankfurt"""), sep='s+')

最新更新