我正在使用两个数据帧,看起来像这样:df1
country_1 column1
united states of america abcd
Ireland (Republic of Ireland) efgh
Korea Rep Of fsdf
Switzerland (Swiss Confederation) dsaa
df2
country_2 column2
united states cdda
Ireland ddgd
South Korea rewt
Switzerland tuut
所需输出:
country_1 column1 country_2 column2
united states of america abcd united states cdda
Ireland (Republic of Ireland) efgh Ireland ddgd
Korea Rep Of fsdf South Korea rewt
Switzerland (Swiss Confederation) dsaa Switzerland tuut
我对文本分析不太熟悉,因此无法理解解决这个问题的任何方法。我已经尝试过字符串匹配和正则表达式,但它不能解决这个问题。
您可以使用difflib。
数据:
data1 = {
"country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
"column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)
data2 = {
"country_2": ["united states", "Ireland", "Korea", "Switzerland"],
"column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)
代码:
import difflib
from dataclasses import dataclass
import pandas as pd
@dataclass()
class FuzzyMerge:
"""
Works like pandas merge except also merges on approximate matches.
"""
left: pd.DataFrame
right: pd.DataFrame
left_on: str
right_on: str
how: str = "inner"
cutoff: float = 0.3
def main(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
return self.left.merge(temp, on=self.left_on, how=self.how)
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
调用类:
merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main()
print(merged)
输出:
country_1 column1 country_2 column2
0 united states of america abcd united states cdda
1 Ireland (Republic of Ireland) efgh Ireland ddgd
2 Korea Rep Of fsdf Korea rewt
3 Switzerland (Swiss Confederation) dsaa Switzerland tuut
您可以通过使用pandas操作来解决这个问题,即使用join,merge和concat:但我建议您先通过concat,因为它很容易从
开始。ps:确保这是Dataframe的形式将其转换为DataFrame
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
使用concat
data = pd.concat([data1, data2], axis=1)