使用Nokogiri解析XML文件以确定路径(Ruby)

我的代码应该"猜测"位于XML文件中相关文本节点之前的路径。在这种情况下，相关意味着:嵌套在重复出现的product/person/something标签中的文本节点，但不包括在该标签之外使用的文本节点。

代码:

    @doc, items = Nokogiri.XML(@file), []
    path = []
    @doc.traverse do |node|
      if node.class.to_s == "Nokogiri::XML::Element"
        is_path_element = false
        node.children.each do |child|
          is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
        end
        path.push(node.name) if is_path_element == true && !path.include?(node.name)
      end
    end
    final_path = "/"+path.reverse.join("/")

适用于简单的XML文件，例如:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
    <item>
      <title>Some product title</title>
      <brand>Some product brand</brand>
    </item>
  </channel>
</rss>
puts final_path # => "/rss/channel/item"

但当事情变得更复杂时，我应该如何应对挑战?例如:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Some XML file title</title>
    <description>Some XML file description</description>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
    <item>
      <titles>
        <title>Some product title</title>
      </titles>
      <brands>
        <brand>Some product brand</brand>
      </brands>
    </item>
  </channel>
</rss>

如果您正在寻找XML中最深层的"父"路径列表，有不止一种方法可以查看它。

虽然我认为您可以调整自己的代码以实现相同的输出，但我确信使用xpath也可以实现相同的效果。我的动机是使我的XML技能不生锈(还没有使用Nokogiri，但我很快就需要专业地使用它)。下面是如何使用xpath获得在它们下面只有一个子级别的所有父路径:

xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }

第二个示例文件的输出是:

/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands

…如果您取出这个列表并删除索引，然后使数组唯一，那么这看起来很像您的循环输出…

paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/[[0-9]+]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

或者一行:

paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/[[0-9]+]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]

我创建了一个库来构建xpath。

xpath = Jini.new
        .add_path('parent')
        .add_path('child')
        .add_all('toys')
        .add_attr('name', 'plane')
        .to_s
puts xpath // -> /parent/child//toys[@name="plane"]

相关内容

最新更新

热门标签：