使用列组合查找panda行中的数据不匹配



根据列值的组合获取所有单元格值的最佳方法是什么?

示例数据帧一:

Stock                         Name  Price
0    AMD       Advanced Micro Devices    100
1     GE     General Electric Company    200
2    BAC  Bank of America Corporation    300
3   AAPL                   Apple Inc.    500
4   MSFT        Microsoft Corporation   1000
5  GOOGL                Alphabet Inc.   2000

示例数据帧二:

Stock                         Name  Price
0    AMD       Advanced Micro Devices    100
1     GE     General Electric Company    200
2    BAC  Branch of America Corporation  300
3   AAPL                   Apple Inc.    500
4   MSFT        Microsoft Corporation   1000
5  GOOGL                Alphabet Inc.   2000

例如:我想使用(Stock和Name(作为关键列,然后比较数据集。目标是打印两个数据集之间不匹配的条目,并将Stock+Name列用作组合键。

我正在使用Pandas/Python3.7

样本输出:

BAC美国银行股份有限公司300-BAC美国分行Corporation 300

也许,使用merge+query的完整INNER JOIN?

df1.merge(df2, on='Stock').query('Name_x != Name_y')
Stock                       Name_x  Price_x                         Name_y  Price_y
2   BAC  Bank of America Corporation      300  Branch of America Corporation      300

或者,map的解决方案略有不同,您可以使用它来获得股票符号:

m = df1.Stock.map(df2.set_index('Stock').Name).ne(df1.Name)
symbols = df1.loc[m, 'Stock']
print(symbols)
2    BAC
Name: Stock, dtype: object

然后通过股票符号访问每个DataFrame行:

df1[df1.Stock.isin(symbols)]
Stock                         Name  Price
2   BAC  Bank of America Corporation    300
df2[df2.Stock.isin(symbols)]
Stock                           Name  Price
2   BAC  Branch of America Corporation    300

如果它们在两个数据帧中,那么使用.concat可以非常简单地无条件地合并它们。一旦他们加入,这里有一种方法可以得到不匹配:

import pandas as pd
df1 = pd.DataFrame({
"Ticker_y": list("qwerty"),
"Name_y": list("asdfgh"),
"Ticker_x": list("qw3r7y"),
"Name_x": list("as6f8h")
})
mismatch = df1[(df1["Ticker_y"] != df1["Ticker_x"]) & (df1["Name_y"] != df1["Name_x"])]

最后一行只是说"df只有在满足这些条件的情况下。">

我们可以使用isin使用值序列进行测试,因为它确保DataFrame中的每个元素都包含在值中

第一个数据帧

>>> df1
Stock                         Name  Price
0    AMD       Advanced Micro Devices    100
1     GE     General Electric Company    200
2    BAC  Bank of America Corporation    300
3   APPL                   Apple Inc.    500
4   MSFT        Microsoft Corporation   1000
5  GOOGL                Alphabet Inc.   2000

第二个数据帧

>>> df2
Stock                           Name  Price
0    AMD         Advanced Micro Devices    100
1     GE       General Electric Company    200
2    BAC  Branch of America Corporation    300
3   APPL                     Apple Inc.    500
4   MSFT          Microsoft Corporation   1000
5  GOOGL                  Alphabet Inc.   2000

给你。。

>>> df2[~df2.Name.isin(df1.Name.values)]
Stock                           Name  Price
2   BAC  Branch of America Corporation    300

>>> df1[~df1.Name.isin(df2.Name.values)]
Stock                         Name  Price
2   BAC  Bank of America Corporation    300

最新更新