我有4列,分别是BuisnessID、Name、BuisnesID_y、Name_y,我想将Name与Name_y匹配,相似度为90%,如果不是90%,则删除这些行。样本输入
df
BusinessID NAME BusinessID_y NAME_y
1013120869 MANOJ WANKHADE 1013404164 SLIMI
1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR
我是python的新手,不知道如何做到这一点。此外,我有50万条记录,所以任何其他方法——其他快速模糊——都是很棒的
>>> import pandas as pd
>>> import rapidfuzz
>>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
>>> df
BusinessID NAME BusinessID_y NAME_y matching_ratio
0 1013120869 MANOJ WANKHADE 1013404164 SLIMI 10.526316
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
2 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
3 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677
>>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn
BusinessID NAME BusinessID_y NAME_y matching_ratio
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677