我正在尝试采用jaccard相似性得分等于或大于%80准确度的行。我有一个像这样的excel文件,有数千行:
id | 字母 |
---|---|
1 | A、B、G |
2 | B,G |
21 | C,D |
14 | A、B、C、D |
46 | F、E、G、J |
35 | X、D、F、A、K |
3 | X、D、F、K、A |
5 | G、E、F、K、J |
56 | G、F、E、J、K |
这就是我目前拥有的。
data = {
'id': [1, 2, 21, 14, 46, 35, 3, 5, 56],
'letters': ['A,B,G', 'B,G', 'C,D', 'A,B,C,D', 'F,E,G,J', 'X,D,F,A,K', 'X,D,F,K,A', 'G,E,F,K,J', 'G,F,E,J,K']
}
df = pd.DataFrame(data)
def jaccard_similarity(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(list1) + len(list2)) - intersection
return float(intersection / union)
new_df = pd.DataFrame(columns=['id', 'letters', 'sub-id'])
for index, row in df.iterrows():
similarity = []
# save all the similarity scores in a list
for index2, row2 in df.iterrows():
similarity.append(jaccard_similarity(row['letters'].split(','), row2['letters'].split(',')))
# find the indexes of the rows that have similarity score bigger than 0.8
indexes = [i for i, x in enumerate(similarity) if x >= 0.8]
# if there is more than one row with similarity score bigger than 0.8
if len(indexes) > 1:
# add the row to the new data frame
new_df = pd.concat([new_df, pd.DataFrame(
[[row['id'], row['letters'], ','.join([str(df.iloc[i]['id']) for i in indexes])]],
columns=['id', 'letters', 'sub-id'])], ignore_index=True)
# order the rows by id
new_df = new_df.sort_values(by=['id'])
# order by sub-id size
new_df = new_df.sort_values(by=['sub-id'], key=lambda x: x.str.len(), ascending=True)
# make sure there is only one row for each id
new_df = new_df.drop_duplicates(subset=['sub-id'])
输出:
id letters sub-id
2 3 X,D,F,K,A 35,3
3 5 G,E,F,K,J 46,5,56
现在我的输出与您的略有不同,我仍在努力弄清楚!
我希望这能有所帮助。
基本上,这里的想法是使用嵌套循环检查整个df上的每一行值相似性得分。
为了检查相似性,我使用了繁琐的包。
即
fuzz.token_sort_ratio('G,E,F,K,J', 'F,E,G,J') #Out 88
fuzz.token_sort_ratio('A,B,C,D', 'C,D') #Out 60
fuzz.token_sort_ratio('B,G','A,B,G') #Out 75
代码:
from thefuzz import fuzz
data = []
for I in range(len(df)):
new = df[df.apply(lambda x: [True if fuzz.token_sort_ratio(df['letters'][I], df['letters'][i]) >= 80 else False for i in range(len(df))])].dropna()
sub_id = new['ID'].tolist()
row = [int(sub_id[0]), max(new['letters'].tolist()), ','.join(str(int(x)) for x in sub_id)]
if len(new) > 1 and row not in data:
data.append(row)
pd.DataFrame(data, columns=['ID', 'Latters', 'Sub-id'])
输出:
ID Latters Sub-id
0 46 G,E,F,K,J 46,5,56
1 35 X,D,F,A,K 35,3
另一种可能的方式:
要先更改字母列,然后通过应用组
代码:
import itertools
letters = [[df['letters'][i] for i,x in enumerate([fuzz.token_sort_ratio(I,i) >= 80 for i in list(df['letters'])]) if x] for I in df['letters'].tolist()]
letters.sort()
letters = list(k for k,_ in itertools.groupby(letters))
df['ID'] = df['ID'].astype("string")
df['letters'] = df['letters'].apply(lambda x: max(list(filter(lambda item: item,[l if x in l else None for l in letters]))[0]))
df[df.duplicated(['letters'], keep=False)].groupby('letters')['ID'].apply(','.join).reset_index()
输出:
letters ID
0 G,E,F,K,J 46,5,56
1 X,D,F,A,K 35,3
注意:Install python-Levenshtein
避免模糊警告