用python计算列的Jaccard相似性



我正在尝试采用jaccard相似性得分等于或大于%80准确度的行。我有一个像这样的excel文件,有数千行:

id 字母
1 A、B、G
2 B,G
21 C,D
14 A、B、C、D
46 F、E、G、J
35 X、D、F、A、K
3 X、D、F、K、A
5 G、E、F、K、J
56 G、F、E、J、K

这就是我目前拥有的。

data = {
'id': [1, 2, 21, 14, 46, 35, 3, 5, 56],
'letters': ['A,B,G', 'B,G', 'C,D', 'A,B,C,D', 'F,E,G,J', 'X,D,F,A,K', 'X,D,F,K,A', 'G,E,F,K,J', 'G,F,E,J,K']
}
df = pd.DataFrame(data)

def jaccard_similarity(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(list1) + len(list2)) - intersection
return float(intersection / union)

new_df = pd.DataFrame(columns=['id', 'letters', 'sub-id'])
for index, row in df.iterrows():
similarity = []
# save all the similarity scores in a list
for index2, row2 in df.iterrows():
similarity.append(jaccard_similarity(row['letters'].split(','), row2['letters'].split(',')))
# find the indexes of the rows that have similarity score bigger than 0.8
indexes = [i for i, x in enumerate(similarity) if x >= 0.8]
# if there is more than one row with similarity score bigger than 0.8
if len(indexes) > 1:
# add the row to the new data frame
new_df = pd.concat([new_df, pd.DataFrame(
[[row['id'], row['letters'], ','.join([str(df.iloc[i]['id']) for i in indexes])]],
columns=['id', 'letters', 'sub-id'])], ignore_index=True)
# order the rows by id
new_df = new_df.sort_values(by=['id'])
# order by sub-id size
new_df = new_df.sort_values(by=['sub-id'], key=lambda x: x.str.len(), ascending=True)
# make sure there is only one row for each id
new_df = new_df.drop_duplicates(subset=['sub-id'])

输出:

id    letters   sub-id
2  3  X,D,F,K,A     35,3
3  5  G,E,F,K,J  46,5,56

现在我的输出与您的略有不同,我仍在努力弄清楚!

我希望这能有所帮助。

基本上,这里的想法是使用嵌套循环检查整个df上的每一行值相似性得分。

为了检查相似性,我使用了繁琐的包。

fuzz.token_sort_ratio('G,E,F,K,J', 'F,E,G,J')    #Out 88
fuzz.token_sort_ratio('A,B,C,D', 'C,D')          #Out 60
fuzz.token_sort_ratio('B,G','A,B,G')             #Out 75

代码:

from thefuzz import fuzz
data = []
for I in range(len(df)):
new = df[df.apply(lambda x: [True if fuzz.token_sort_ratio(df['letters'][I], df['letters'][i]) >= 80 else False for i in range(len(df))])].dropna()
sub_id = new['ID'].tolist()
row = [int(sub_id[0]), max(new['letters'].tolist()), ','.join(str(int(x)) for x in sub_id)]

if len(new) > 1 and row not in data:
data.append(row)


pd.DataFrame(data, columns=['ID', 'Latters', 'Sub-id'])

输出:

ID  Latters     Sub-id
0   46  G,E,F,K,J   46,5,56
1   35  X,D,F,A,K   35,3

另一种可能的方式

要先更改字母列,然后通过应用组

代码:

import itertools
letters = [[df['letters'][i] for i,x in enumerate([fuzz.token_sort_ratio(I,i) >= 80 for i in list(df['letters'])]) if x] for I in df['letters'].tolist()]
letters.sort()
letters = list(k for k,_ in itertools.groupby(letters))
df['ID'] = df['ID'].astype("string")
df['letters'] = df['letters'].apply(lambda x: max(list(filter(lambda item: item,[l if x in l else None for l in letters]))[0]))
df[df.duplicated(['letters'], keep=False)].groupby('letters')['ID'].apply(','.join).reset_index()

输出:

letters     ID
0   G,E,F,K,J   46,5,56
1   X,D,F,A,K   35,3

注意Install python-Levenshtein避免模糊警告

最新更新