逐行查找两个数据帧之间的相似性



我有两个列相同的数据帧,df1和df2。我想找出这两个数据集之间的相似性。我一直在遵循这两种方法中的一种。第一个是将两个数据帧中的一个附加到另一个数据帧上,并选择重复数据:

df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)
df[mask] # it gives me duplicated rows

第二个是考虑一个阈值,对于df1中的每一行,该阈值在df2中的行中找到潜在的匹配。

数据示例:请注意,数据集具有不同长度的

对于df1

Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football 

对于df2

Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy? 
Trump did not win the last presidential election

为了做到这一点,我使用了以下功能:

def check(df1, thres, col):
matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
return [df1. Check[i] for i, x in enumerate(matches) if x]

这将允许我查找匹配的行。

第二种方法(我最感兴趣的方法(的问题是,它实际上没有考虑这两个数据帧。

我对第一个函数的期望值是两个数据帧,一个用于df1,另一个用于df2,具有一个额外的列,该列包括与其他数据帧中的数据帧相比每行发现的相似性;从第二段代码中,我应该只为它们分配一个相似性值(我应该有和行数一样多的列(。

如果您需要任何进一步的信息或需要更多的代码,请告诉我。也许有一种更容易的方法来确定这种相似性,但不幸的是,我还没有找到。

欢迎提出任何建议。

预期输出:

(这是一个例子;由于我正在设置阈值,输出可能会改变(

df1

Check                             sim
how to join to first row         []
large data work flows            [small data work flows]
I have two dataframes            []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces          [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here?        []
add language identifier          []
my dad loves watching football   []

df2

Check                             sim
small data work flows                [large data work flows]
I have tried to puzze out an answer   []
mix grammatical or spelling errors    [fix grammatical or spelling errors]
indent code by 2 spaces               [indent code by 4 spaces]
indent code by 8 spaces               [indent code by 4 spaces]
put returns between paragraphs        []
add curry on the chicken curry        []
mom!! mom!! mom!!                     []
create code fences with backticks     []
are you crazy?                        []
Trump did not win the last presidential election    []

我认为你的fuzzywuzzy解决方案非常好。我已经实现了您在下面寻找的内容。这将以len(df1)*len(df2)的形式增长,这相当占用内存,但至少应该相当清楚。您可能会发现gen_scores的输出也很有用。

from fuzzywuzzy import fuzz 
from itertools import product
def gen_scores(df1, df2):
# generates a score for all row combinations between dfs
df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
return df_score
def get_matches(df1, df2, thresh=0.5):
# get all matches above a threshold, appended as list to each df
df = gen_scores(df1, df2)
df = df[df.score > thresh]
matches = df.groupby("c1").c2.apply(list)
df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
df1 = df1.rename(columns={"c2":"matches"})
df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])
matches = df.groupby("c2").c1.apply(list)
df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
df2 = df2.rename(columns={"c1":"matches"})
df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
return (df1, df2)
# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)

输出:

Check                                            matches
0                           how to join to first row                                                 []
1                              large data work flows                            [small data work flows]
2                              I have two dataframes                                                 []
3  fix grammatical or spelling errors [mix gramma...               [mix grammatical or spelling errors]
4                            indent code by 4 spaces  [indent code by 2 spaces, indent code by 8 spa...
5                          why are you posting here?                                   [are you crazy?]
6                            add language identifier                                                 []
7                     my dad loves watching football                                                 []

最新更新