我有一个原始文件和一个主文件,例如:
raw_file
{'resident', 'gulf corp', 'international', 'perl', 'mntain valley'}
master_file
{'mountain valley', 'gulf corp', 'president', 'national', 'perl'}
我想在两个文件中找到相似的字符串。我在python中使用了fuzzy.ratio。
我的输出如下:
resident - president - 98,
gulf corp - gulf corp - 100,
international - national - 85,
perl - perl - 100,
mntain valley - mountain valley - 87
Required output :
resident
gulf corp - gulf corp - 100,
international
perl - perl - 100,
mntain valley - mountain valley - 87
要求:当原始文件中的名称有意义时,即更正而没有任何拼写错误,如果未找到,则应检查 100% 匹配,它应返回空。
有什么办法可以做到这一点吗?
我考虑过限制第一个单词,但在以下情况下无济于事
苹果一号 - 一号阿普尔
ratio = (fuzz.ratio(str1,str2))
在R
中,你可以简单地检查相等性。
raw_file = c('resident', 'gulf corp', 'international', 'perl', 'mntain valley')
master_file = c('mountain valley', 'gulf corp', 'president', 'national', 'perl')
df = data.frame(raw=raw_file,master=master_file,
match=ifelse(raw_file==master_file,"100",""),stringsAsFactors = FALSE)
> df
raw master match
1 resident mountain valley
2 gulf corp gulf corp 100
3 international president
4 perl national
5 mntain valley perl
如果相同单词的 p 位置无关紧要,请更改==
%in%
> df = data.frame(raw=raw_file,master=master_file,
+ match=ifelse(raw_file%in%master_file,"100",""),stringsAsFactors = FALSE)
> df
raw master match
1 resident mountain valley
2 gulf corp gulf corp 100
3 international president
4 perl national 100
5 mntain valley perl