创建匹配单词的哈希值,出现次数



我正在开发一个 ruby 程序,它将获取一个字符串并将其与单词的"字典"进行比较,并将返回一个哈希值,其中包含哪些单词匹配以及它们匹配了多少次。到目前为止,我能够遍历字符串和数组,当它找到匹配项时它会返回一个字符串,但我不知道如何使用匹配的单词和出现次数创建哈希。这是代码-

dictionary = ["below","down","go","going","horn","how","howdy","it","i","low","own","part","partner","sit"]
def substrings (string, dictionary)
  dictionary = dictionary
  words = string.split(/s+/)
  puts words
  x = 0
  while x < words.length do
    y = 0
    while y < dictionary.length do
      if words[x] == dictionary[y] 
      puts "it's working"
    end
    y += 1 
  end   
  x += 1
  end
end
substrings("let's go down below", dictionary)

关于如何制作哈希的任何想法将不胜感激,谢谢!

冥想一下:

'b c c d'.split # => ["b", "c", "c", "d"]
'b c c d'.split.group_by{ |w| w } # => {"b"=>["b"], "c"=>["c", "c"], "d"=>["d"]}
'b c c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] } # => [["b", 1], ["c", 2], ["d", 1]]
'b c c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h # => {"b"=>1, "c"=>2, "d"=>1}

由此我们可以构建:

dictionary = ['b', 'c']
word_count = 'b c c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h
word_count.values_at(*dictionary) # => [1, 2]

如果您只想要字典中的键/值对,则可以轻松完成:

require 'active_support/core_ext/hash/slice'
word_count.slice(*dictionary) # => {"b"=>1, "c"=>2}

group_by是一种非常有用的方法,它根据您传递给它的任何标准进行分组。 values_at获取"键"列表并返回其相应的值。

计算"单词"时存在潜在的问题,因为并非所有文本在将单词拆分为其组件子字符串后都会产生我们认为的单词。例如:

'how now brown cow.'.split # => ["how", "now", "brown", "cow."]

请注意,最后一个单词的标点符号包含在字符串中。同样,复合词和其他穿刺也会导致问题:

'how-now brown, cow.'.split # => ["how-now", "brown,", "cow."]

然后,任务变成了如何删除这些被视为单词的一部分。简单的事情就是简单地将它们剥离出来:

'how-now brown, cow.'.gsub(/[^a-z]+/, ' ').split # => ["how", "now", "brown", "cow"]

然而,在当今疯狂的时代,我们也看到包含数字的单词,尤其是公司和程序名称之类的东西。您可以修改上面gsub中的模式来处理这个问题,但如何解决需要您自己弄清楚。

我们还看到混合大小写,因此您的字典需要折叠为大写或小写,并且正在考虑的字符串也需要以相同的方式折叠,除非您想知道在尊重字符大小写时的不同计数:

word_count = 'b C c d'.downcase.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h # => {"b"=>1, "c"=>2, "d"=>1}
word_count = 'b C c d'.split.group_by{ |w| w }.map{ |k, v| [k, v.count] }.to_h # => {"b"=>1, "C"=>1, "c"=>1, "d"=>1}

分析页面内容通常从这种代码开始,但必须编写许多规则来指定哪些是有用的单词,哪些是垃圾。而且,规则经常从一个来源到另一个来源发生变化,因为它们对单词和数字的使用会迅速破坏代码的有用性:

second
2nd

例如。它变得"有趣"。

这是另一种方法:

def substrings (string, dictionary)
  dictionary.each.with_object({}){|w, h| h[w] = string.scan(/b#{w}b/).length}
end
substrings("let's go down below", dictionary)

输出:

{
  "below"   => 1,
  "down"    => 1,
  "go"      => 1,
  "going"   => 0,
  "horn"    => 0,
  "how"     => 0,
  "howdy"   => 0,
  "it"      => 0,
  "i"       => 0,
  "low"     => 0,
  "own"     => 0,
  "part"    => 0,
  "partner" => 0,
  "sit"     => 0
}

在 Cary 给出的计数Hash描述的基础上,您的代码可以稍微修改如下。

dictionary = ["below","down","go","going","horn","how","howdy","it","i","low","own","part","partner","sit"]
def substrings (string, dictionary)
  words = string.split(/s+/)
  count_hash = Hash.new(0)
  words.each do |sentence_word|
    dictionary.each do |dictionary_word|
        if sentence_word == dictionary_word
            count_hash[sentence_word] += 1
        end
    end   
  end
  return count_hash
end
p substrings("let's go down below", dictionary)

但是,鉴于有一个方法 Array#count ,我们可以利用它的优势并将上面的代码简化为如下所示的内容。在这个版本中,我们不需要计算哈希。

def substrings (string, dictionary)
  words = string.split(/s+/)
  count_hash = Hash.new
  dictionary.each do |dictionary_word|
    if (count = words.count(dictionary_word)) > 0
        count_hash[dictionary_word] = count
    end
  end   
  return count_hash
end

您可以参考其他答案以获取更惯用的 Ruby 解决方案。 如果我必须尝试一下,下面是我的版本

def substrings (string, dictionary)
  words = string.split(/s+/)
  dictionary.map { |d| [d, words.count(d)] }.to_h.reject  {|_, v| v == 0}
end

一种方法是创建有时称为"计数哈希"的内容:

h = Hash.new(0)

这里的零是"默认值"。这意味着如果h没有键kh[k]返回零(但哈希不会改变(。然后,您将拥有:

h[k] += 1

扩展到:

h[k] = h[k] + 1

如果h有一个键k,右边的h[k]就会有一个值,所以鲍勃是你的叔叔。但是,如果h没有键k,则右侧的h[k]设置为默认值,因此表达式变为:

h[k] = 0 + 1

很酷吧?

所以对于你的问题,你可以写:

dictionary = %w| below down go going horn how howdy it i low own part partner sit |
  #=> ["below", "down", "go", "going", "horn", "how", "howdy", "it", "i",
  #    "low", "own", "part", "partner", "sit"] 
string = "Periscope down, so we can go down, way down, below the surface."
string.delete(',.').split.downcase.each_with_object(Hash.new(0)) { |word,h|
  (h[word] += 1) if dictionary.include?(word) }
  #=> {"down"=>3, "go"=>1, "below"=>1}

您可能还会看到这样写道:

string.delete(',.').downcase.split.each_with_object({}) do |word,h|
  h[word.downcase] = (h[word] || 0) + 1 if dictionary.include?(word) }

所以如果h没有键wordh[word]将被nil,所以表达式变成:

h[word] = (h[word] || 0) + 1
  #=>   = (nil     || 0) + 1
  #=>   = 0 + 1  

另一种方法是首先计算string中每个单词的实例数,然后查看字典中的实例数:

h = string.delete(',.').downcase.split.group_by(&:itself)
  #=> {"periscope"=>["periscope"], "down"=>["down", "down", "down"], "so"=>["so"],
  #    "we"=>["we"], "can"=>["can"], "go"=>["go"], "way"=>["way"], "below"=>["below"],
  #    "the"=>["the", "the"], "surface"=>["surface"]}
h.each_with_object({}) { |(k,v),g| g[k] = v.size if dictionary.include?(k) }
  #=> {"down"=>3, "go"=>1, "below"=>1}

(编辑:请参阅@theTinMan的答案,以获取使用Enumerable#group_by的更好方法(。

最新更新