如何使用Nokogiri阅读器接口告诉节点的行号

我正在尝试编写一个Nokogiri脚本，将为包含ASCII双引号(«"»)的文本节点grep XML。因为我想要一个类似于grep的输出，所以我需要行号和每行的内容。然而，我不知道如何告诉行号，元素开始的地方。下面是我的代码:

require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
    xml_stream = File.open(filename)
    reader = Nokogiri::XML::Reader(xml_stream)
    titles = []
    text = ''
    grab_text = false
    reader.each do |elem|
        if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
            data = elem.value
            lines = data.split(/n/, -1);
            lines.each_with_index do |line, idx|
                if (line =~ /"/) then
                    STDOUT.printf "%s:%d:%sn", filename, elem.line()+idx, line
                end
            end
        end
    end
end

XML和解析器实际上没有行号的概念。你说的是文件的物理布局。

您可以使用访问器与解析器玩一个游戏，查找包含换行和/或回车的文本节点，但这可以被丢弃，因为XML允许嵌套节点。

require 'nokogiri'
xml =<<EOT_XML
<atag>
  <btag>
    <ctag 
      id="another_node">
      other text
    </ctag>
  </btag>
  <btag>
    <ctag id="another_node2">yet
                             another
                             text</ctag>
    </btag>
  <btag>
    <ctag id="this_node">this text</ctag>
  </btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[rn]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
  content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]

这是有效的，因为解析器能够找出什么是文本节点并干净地返回它，而不依赖于正则表达式或文本匹配。

编辑:我浏览了一些旧代码，在Nokogiri的文档中窥探了一些，并提出了上面编辑的更改。它运作正常，包括处理一些病理病例。Nokogiri增值!

从1.2.0(发布于2009-02-22)开始，Nokogiri支持node# line，它返回源代码中定义该节点的行号。

它似乎使用了libxml2函数xmlGetLineNo()。

require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[@arch="x86_64"]').each do |node|
    puts '%4d %s' % [node.line, node['name']]
end

注意如果你正在处理大型xml文件(>65535行)，请确保使用Nokogiri 1.13.0或更新版本(发布于2022-01-06)，否则您的Node#line结果将不准确。

相关内容

最新更新

热门标签：