使用功能样式在流中匹配任意数量的以下令牌



问题如下:

  1. 有一个带有令牌的文件 - 每个令牌都位于单独的行中,伴随着一些元数据(例如文档ID(,
  2. 应该计算令牌的某些序列,序列可能是一个或多个令牌,
  3. 序列保存在trie中,但这不是要求,
  4. 实现必须非常有效,因为要处理的文件具有千兆字节的数据。

我目前的实施(在Ruby中(如下:

def convert_tuple(tuple)
  document_id, token_index, space, token = *tuple
  token = token.chomp
  token.force_encoding("ascii-8bit")
  document_id = document_id.to_i
  [document_id, token_index, space, token]
end
def count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)
  token_id = index[string]
  if token_id
    STDERR.puts "%st%st%st%s" % [document_id, first_token_index, last_token_index, string]
    counts[string] += 1
  end
  index.search(string).size > 0
end
counts = Hash.new(0)
index = Melisa::IntTrie.new
index.load(index_path)
CSV.open(input_path, col_sep: "t") do |input|
  input.each do |tuple|
    document_id, first_token_index, space, token = convert_tuple(tuple)
    recoreded_pos = input.pos
    last_token_index = first_token_index
    string = token.dup
    while(count_and_match_tokens(string, index, counts, document_id, first_token_index, last_token_index)) do
      last_document_id, last_token_index, space, last_token = convert_tuple(input.shift)
      break if document_id != last_document_id
      string << " " if space == "1"
      string << last_token
    end
    input.pos = recoreded_pos
  end
end  
CSV.open(output_path,"w") do |output|
  counts.each do |tuple|
    output << tuple
  end
end

convert_tuple函数仅使数据基本转换(即将字符串转换为数字等(。

如果传递的字符串参数是另一个字符串的前缀,则count_and_match_tokens函数计算令牌并返回true。我使用Trie结构有效验证此情况。

我想知道如何使用功能样式编写的解决方案。我面临的问题是,匹配的序列可能跨越许多令牌。

在Ruby中(或通常以OO样式(,我可以记录我启动匹配(recorded_pos = input.pos(和"重置"流的位置,当子序列匹配结束时(input.pos = recorded_pos(。结果,随后对each的调用将返回流中的下一个令牌。因此,在已经识别的序列(while循环内处理的令牌(内部的令牌也可以在其他子序列中首先是匹配的令牌。

我将感谢在长生不老药方面的解决方案,但是任何其他功能语言也可以。

编辑

我提供了convert_tuplecount_and_match_tokens的定义以及示例输入和输出(文件被截断,因此计数并不直接与输入文件对应(。

代码中出现的索引数据结构是Maris Trie(Melisa Gem:https://github.com/wordtreefoundation/melisa/(

示例输入:

0   746 1   The
0   748 1   river
0   751 1   Bosna
0   754 1   (
0   763 0   )
0   765 1   (
0   766 0   Cyrillic
0   767 0   :
0   769 1   Босна
0   770 0   )
0   772 1   is
0   774 1   the
0   776 1   third
0   778 1   longest
0   781 1   river
0   784 1   in
0   787 1   Bosnia
0   789 1   and
0   791 1   Herzegovina
0   793 0   ,
0   795 1   and
0   797 1   is
0   799 1   considered
0   801 1   one
0   803 1   of
0   805 1   the
0   807 1   country
0   808 0   '
0   809 0   s
0   811 1   three
0   813 1   major
0   815 1   internal
0   817 1   rivers

要识别的令牌序列:

Bosnia
Bosnia and Herzegovina
river
Herzegovina

示例输出:

river,2
Bosnia,1
Bosnia and Herzegovina,1
Herzegovina,1

我希望这有助于理解我要解决的问题。

一个可运行的程序(count_sporences.rb(:

#!/usr/bin/env ruby
require 'set'
sequence_file, token_file = ARGV
sequences = Set.new
forest = File.readlines(sequence_file).each{|s| sequences << s.tap(&:chomp!)}.map!(&:split).each_with_object({}) do |words, root|
  words.reduce(root) do |parent, word|
    (parent[word] ||= [0, {}])[1]
  end
end
#=>  {
#      "Bosnia" => [0, {
#        "and" => [0, {
#          "Herzegovina" => [0, {}]
#        }]
#      }],
#      "river" => [0, {}]
#    }
File.open(token_file) do |f|
  current_node = forest
  f.each_line do |line|
    token = line.tap(&:chomp!).split[-1]
    spec = current_node[token] || forest[token]
    if spec
      spec[0] += 1
      current_node = spec[1]
    else
      current_node = forest
    end
  end
end
#=>  {
#      "Bosnia" => [1, {
#        "and" => [1, {
#          "Herzegovina" => [1, {}]
#        }]
#      }],
#      "river" => [2, {}]
#    }
def print_tree(node, sequences, parent = nil)
  node.each do |word, spec|
    sequence = [parent, word].compact.join(' ')
    puts "#{sequence},#{spec[0]}" if sequences.include? sequence
    print_tree(spec[1], sequences, sequence)
  end
end
print_tree(forest, sequences)

您可以使用

运行它
$ ruby count_sequences.rb /path/to/sequences.txt /path/to/tokens.txt

它输出

Bosnia,1
Bosnia and Herzegovina,1
river,2

相关内容

最新更新