您如何根据特定的键值对比较两个熊猫系列?

我有2个熊猫系列词典如下：

series_1 = [{'id': 'testProd_1', 'q1':'Foo1', 'q2': 'Bar1'},
{'id': 'testProd_2', 'q1':'Foo2', 'q2': 'Bar2'},
{'id': 'testProd_3', 'q1':'Foo3', 'q2': 'Bar3'},
{'id': 'testProd_5', 'q1':'Foo5', 'q2': 'Bar5'}
]
series_2 = [{'q1':'Foo1', 'q2': 'Bar1'},
{'q1':'Foo2', 'q2': 'Bar2'}, 
{'q1':'Foo3', 'q2': 'Bar3'}, 
{'q1':'Foo4', 'q2': 'Bar4'}, 
{'q1':'Foo5', 'q2': 'Bar{5}'}]

我正在尝试比较两个熊猫系列，并提供从series_1到所有匹配series_2字典的id。

expected_result = [{'id': 'testProd_1', 'q1':'Foo1', 'q2': 'Bar1'},
{'id': 'testProd_2', 'q1':'Foo2', 'q2': 'Bar2'},
{'id': 'testProd_3', 'q1':'Foo3', 'q2': 'Bar3'},
{'id': 'testProd_5', 'q1':'Foo5', 'q2': 'Bar{5}'}]

序列相等不起作用，因为一个序列对每个字典都有一个额外的键值对 ('id'(。我必须遍历每个单独的条目吗？获得expected_result最有效的方法是什么？

我正在使用 2 个大型数据集，我试图将 id 从一个系列链接到另一个系列。数据基本相同，但有时某些键值对中的值有一些错误的字符(例如：{5}，(5(，{ex.5}(。

有什么建议吗？

谢谢

所以看起来你想用的是merge.据我了解，您希望在"q1"键上找到两个数据帧的内部联接。如果是这样，那么合并绝对是适合您的功能。它的使用方式如下：

series_join = series_1.merge(series_2, on='q1')

这样，它将找到 q1 的交集，并且只选择匹配的数据对。如果您确实想同时连接q1和q2，您可以简单地在此处传入一个数组(尽管这不会给出您想要的输出，因为不幸的是Bar5无法与Bar{5}进行比较：

series_join = series_1.merge(series_2, on=['q1', 'q2'])

至于从数据中清除错误值以便以这种方式比较它们，我建议首先执行清理步骤，因为主合并步骤对如何比较数据值没有太多自定义。

输出将包含一组重复的列，但您仍然可以忽略这些列：

id    q1  q2_x    q2_y
0  testProd_1  Foo1  Bar1    Bar1
1  testProd_2  Foo2  Bar2    Bar2
2  testProd_3  Foo3  Bar3    Bar3
3  testProd_5  Foo5  Bar5  Bar{5}

这是它运行的位置。

编辑：保留重复项

合并的默认功能是它将保留两个表中的所有重复键。在这里操作重复项的问题是，熊猫不知道哪一行是预期的查找行，所以它只会为每个组合创建一个对。如以下示例所示(系列 1、2，然后连接(：

id    q1    q2
0  testProd_1  Foo1  Bar1
1  testProd_2  Foo2  Bar2
2  testProd_3  Foo3  Bar3
3  testProd_5  Foo5  Bar5
4  testProd_6  Foo5  Bar6
q1      q2
0  Foo1    Bar1
1  Foo2    Bar2
2  Foo3    Bar3
3  Foo4    Bar4
4  Foo5  Bar{5}
5  Foo5  Bar{6}
id    q1    q2_y
0  testProd_1  Foo1    Bar1
1  testProd_2  Foo2    Bar2
2  testProd_3  Foo3    Bar3
3  testProd_5  Foo5  Bar{5} <<< [3  testProd_5  Foo5  Bar5] + [4  Foo5  Bar{5}]
4  testProd_5  Foo5  Bar{6} <<< [3  testProd_5  Foo5  Bar5] + [5  Foo5  Bar{6}]
5  testProd_6  Foo5  Bar{5} <<< [4  testProd_6  Foo5  Bar6] + [4  Foo5  Bar{5}]
6  testProd_6  Foo5  Bar{6} <<< [4  testProd_6  Foo5  Bar6] + [5  Foo5  Bar{6}]

因此，没有一个简单的方法可以说"选择第二个表的第一行"，但您可以做的是简单地使用类似drop_duplicates的函数事先删除第二个表中的重复项。

你可以像这样使用熊猫：

pd.DataFrame(series_1)[['id','q1']].merge(pd.DataFrame(series_2), on=['q1']).to_dict('records')

输出：

[{'id': 'testProd_1', 'q1': 'Foo1', 'q2': 'Bar1'},
{'id': 'testProd_2', 'q1': 'Foo2', 'q2': 'Bar2'},
{'id': 'testProd_3', 'q1': 'Foo3', 'q2': 'Bar3'},
{'id': 'testProd_5', 'q1': 'Foo5', 'q2': 'Bar{5}'}]

使用有问题的新数据进行更新

熊猫将创建一个笛卡尔生产，供 1 对多加入或多对多加入。因此，您将组合。

df1.merge(df2, on=['q1'])

输出：

id    q1  q2_x    q2_y
0  testProd_1  Foo1  Bar1    Bar1
1  testProd_2  Foo2  Bar2    Bar2
2  testProd_3  Foo3  Bar3    Bar3
3  testProd_5  Foo5  Bar5  Bar{5}
4  testProd_5  Foo5  Bar5  Bar{6}
5  testProd_6  Foo5  Bar6  Bar{5}
6  testProd_6  Foo5  Bar6  Bar{6}

<小时 />

无重复项

如果没有重复项，您可以创建一个 cumcount，以便第一行连接到 df2 中的第一行，如下所示：

df1m = df1.assign(mergekey=df1.groupby('q1').cumcount())
df2m = df2.assign(mergekey=df2.groupby('q1').cumcount())
df1m.merge(df2m, on=['q1','mergekey'])

输出：

id    q1  q2_x  mergekey    q2_y
0  testProd_1  Foo1  Bar1         0    Bar1
1  testProd_2  Foo2  Bar2         0    Bar2
2  testProd_3  Foo3  Bar3         0    Bar3
3  testProd_5  Foo5  Bar5         0  Bar{5}
4  testProd_6  Foo5  Bar6         1  Bar{6}

感谢您的所有反馈。

我使用了上述答案的组合来得出适合我的解决方案。

series_2有太多的 q1 和 q2 值包含错误的字符(例如："{"、"."、"}"等(，并且混合了大写和小写。

我首先应用了一个应用程序来清理全部小写的值，并使用替换删除特殊字符。

# Creates a uniform value string 
def getTrueString(valString):

trueString= valString.lower()
remove_specialChrs = [' ','{','}','ex.']

for char in remove_specialChrs:
trueString= trueString.replace(char,'')

return trueString.strip()

从那里，我将其应用于我的 2 个系列(假设我转换为数据框(

series_1['trueString'] = series_1['valString'].apply(getTrueString)
series_2['trueString'] = series_2['valString'].apply(getTrueString)

现在，由于trueString是干净的(小写并删除了所有特殊字符(，然后我按照Scott Boston和Daneolog在上面的帖子中的建议使用了pandas合并。

joined_data = pd.merge(series_2, series_1, on='trueString', how='left' )

生成的 dataFrame 显示基于 trueString 的所有匹配项是否相同，对于不匹配的匹配项，它保持为空。这是因为我选择了左连接(您也可以使用右连接并切换 2 个输入帧(而不是内部，因为我想查看所有series_2数据，无论是否找到 id。

希望这有帮助。

使用有问题的新数据进行更新

无重复项

相关内容

最新更新

热门标签：