从Ruby中的哈希过滤重复的子字符串



我正在编写一个铁轨应用程序以获取新闻页面的RSS feed,将词性标记应用到标题上,从标题中获取名词 - 次数和次数每个发生。我需要过滤出其他名词短语的名词词组,并且正在使用此代码这样做:

filtered_noun_phrases = sorted_noun_phrases.select{|a|
  sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h

所以:

{"troops retake main government office"=>2,
 "retake main government office"=>2, "main government office"=>2}

应该只是:

{"troops retake main government office"=>2}

但是,类似的名词 - 诸如此类的标签:

{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
 "retake main government office"=>2, "mosul retake government base"=>2,
 "toddler killer shot dead"=>2, "students fighting racism"=>2,
 "retake government base"=>2, "main government office"=>2,
 "white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
 "cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
 "boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
 "silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
 "george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
 "iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
 "haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}

而不是部分过滤器:

{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
 "retake main government office"=>2, "mosul retake government base"=>2,
 "toddler killer shot dead"=>2, "students fighting racism"=>2,
 "retake government base"=>2, "main government office"=>2,
 "white house tourists"=>2, "horn at french zoo"=>2,
 "cia hacking tools"=>2, "killer shot dead"=>2,
 "boko haram teen"=>2}

那么,如何从实际有效的哈希中滤除重复的子字符串?

您当前正在做的是选择存在任何短语的所有短语,即该短语的子字符串。

对于"部队重新夺回主要政府办公室"这是真的,正如我们发现的"重演主要政府办公室"。

但是,对于"重演主要政府办公室",我们仍然找到"主要政府办公室",因此没有过滤。

例如:

 filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h

您可以拒绝所有包含短语的字符串的短语。

filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h

- trueunlessfalse

最新更新