Python根据列的文本相似度合并两个数据框



我正在使用两个数据帧,看起来像这样:df1

country_1                               column1
united states of america                   abcd
Ireland (Republic of Ireland)              efgh
Korea Rep Of                               fsdf
Switzerland (Swiss Confederation)          dsaa

df2

country_2                               column2
united states                              cdda
Ireland                                    ddgd
South Korea                                rewt
Switzerland                                tuut

所需输出:

country_1                               column1              country_2                column2
united states of america                   abcd              united states            cdda
Ireland (Republic of Ireland)              efgh              Ireland                  ddgd
Korea Rep Of                               fsdf              South Korea              rewt
Switzerland (Swiss Confederation)          dsaa              Switzerland              tuut

我对文本分析不太熟悉,因此无法理解解决这个问题的任何方法。我已经尝试过字符串匹配和正则表达式,但它不能解决这个问题。

您可以使用difflib。

数据:

data1 = {
"country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
"column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)
data2 = {
"country_2": ["united states", "Ireland", "Korea", "Switzerland"],
"column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)
代码:

import difflib
from dataclasses import dataclass
import pandas as pd

@dataclass()
class FuzzyMerge:
"""
Works like pandas merge except also merges on approximate matches.
"""
left: pd.DataFrame
right: pd.DataFrame
left_on: str
right_on: str
how: str = "inner"
cutoff: float = 0.3
def main(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
return self.left.merge(temp, on=self.left_on, how=self.how)
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None

调用类:

merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main()
print(merged)

输出:

country_1 column1      country_2 column2
0           united states of america    abcd  united states    cdda
1      Ireland (Republic of Ireland)    efgh        Ireland    ddgd
2                       Korea Rep Of    fsdf          Korea    rewt
3  Switzerland (Swiss Confederation)    dsaa    Switzerland    tuut

您可以通过使用pandas操作来解决这个问题,即使用join,merge和concat:但我建议您先通过concat,因为它很容易从

开始。ps:确保这是Dataframe的形式将其转换为DataFrame

data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)

使用concat

data = pd.concat([data1, data2], axis=1)

相关内容

  • 没有找到相关文章

最新更新