我正在编写一个铁轨应用程序以获取新闻页面的RSS feed,将词性标记应用到标题上,从标题中获取名词 - 次数和次数每个发生。我需要过滤出其他名词短语的名词词组,并且正在使用此代码这样做:
filtered_noun_phrases = sorted_noun_phrases.select{|a|
sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h
所以:
{"troops retake main government office"=>2,
"retake main government office"=>2, "main government office"=>2}
应该只是:
{"troops retake main government office"=>2}
但是,类似的名词 - 诸如此类的标签:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
"boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
"silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
"george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
"iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
"haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}
而不是部分过滤器:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2,
"boko haram teen"=>2}
那么,如何从实际有效的哈希中滤除重复的子字符串?
您当前正在做的是选择存在任何短语的所有短语,即该短语的子字符串。
对于"部队重新夺回主要政府办公室"这是真的,正如我们发现的"重演主要政府办公室"。
但是,对于"重演主要政府办公室",我们仍然找到"主要政府办公室",因此没有过滤。
例如:
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
您可以拒绝所有包含短语的字符串的短语。
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h
- trueunlessfalse