我想在下表中使用fuzzywuzzy包
x Reference amount
121 TOR1234 500
121 T0R1234 500
121 W7QWER 500
121 W1QWER 500
141 TRYCATC 700
141 TRYCATC 700
151 I678MKV 300
151 1678MKV 300
- 我想对列"x"和"金额"匹配的表进行分组。
- 对于组中的每个引用i. 与该组中的其他参考文献进行比较(模糊(。 a. 如果匹配为 100%,请将其删除 B.如果匹配为90-99.99%,则保留它们 c. 删除该特定行低于 90% 匹配度的任何内容预期输出-
x y amount
151 I678MKV 300
151 1678MKV 300
121 TOR1234 500
121 T0R1234 500
121 W7QWER 500
121 W1QWER 500
这是为了检测欺诈条目,就像在表格中一样,"1"被"I"取代,"0"被替换为"O"。如果您有任何替代解决方案,请提出建议。
我所了解的你不需要fuzzywuzzy
包方法使用简单drop_duplicates
with keep=False
df = pd.DataFrame(data={"x":[121,121,121,121,141,141,151,151],
"Refrence":["TOR1234","T0R1234","W7QWER","W1QWER","TRYCATC","TRYCATC"
,"I678MKV","1678MKV"],
"amount":[500,500,500,500,700,700,300,300]})
res = df.drop_duplicates(['x','Refrence','amount'],keep=False).sort_values(['x'],ascending=[False])
print(res)
x Refrence amount
6 151 I678MKV 300
7 151 1678MKV 300
0 121 TOR1234 500
1 121 T0R1234 500
2 121 W7QWER 500
3 121 W1QWER 500
在相同的 x 内对参照应用列文施泰因距离
from itertools import combinations
from similarity.damerau import Damerau
levenshtien = Damerau()
data = list(combinations(res['Refrence'], 2))
refrence_df = pd.DataFrame(data,columns=['Refrence','Refrence2'])
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],on=['Refrence'],how='left')
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],left_on=['Refrence2'],right_on=['Refrence'],how='left')
refrence_df.rename(columns={'x_x':'x_1','x_y':'x_2','Refrence_x':'Refrence'},inplace=True)
refrence_df.drop(['Refrence_y'],axis=1,inplace=True)
refrence_df = refrence_df[refrence_df['x_1']==refrence_df['x_2']]
refrence_df['edit_required'] = refrence_df.apply(lambda x: levenshtien.distance(x['Refrence'],x['Refrence2']),
axis=1)
refrence_df['characters_not_common'] = refrence_df.apply(lambda x :list(set(x['Refrence'])-set(x['Refrence2'])),axis=1)
print(refrence_df)
Refrence Refrence2 x_1 x_2 edit_required characters_not_common
0 I678MKV 1678MKV 151 151 1 [I]
9 TOR1234 T0R1234 121 121 1 [O]
10 TOR1234 W7QWER 121 121 7 [O, T, 1, 3, 2, 4]
11 TOR1234 W1QWER 121 121 7 [O, T, 3, 2, 4]
12 T0R1234 W7QWER 121 121 7 [T, 1, 0, 3, 2, 4]
13 T0R1234 W1QWER 121 121 7 [T, 0, 3, 2, 4]
14 W7QWER W1QWER 121 121 1 [7]