如何将XML解析为CSV，其中数据仅在属性中

我试图解析的XML文件的所有数据都包含在属性中。我找到了如何构建要插入到文本文件中的字符串。

我有这个XML文件：

<ig:prescribed_item class_ref="0161-1#01-765557#1">
  <ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
    <dt:measure_number_type representation_ref="0161-1#04-000005#1">
      <dt:real_type>
        <dt:real_format pattern="d(1,).d(1,)"/>
      </dt:real_type>
      <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
    </dt:measure_number_type>
  </ig:prescribed_property>
  <ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
    <dt:measure_number_type representation_ref="0161-1#04-000005#1">
      <dt:real_type>
        <dt:real_format pattern="d(1,).d(1,)"/>
      </dt:real_type>
      <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
    </dt:measure_number_type>
  </ig:prescribed_property>
</ig:prescribed_item>
  </ig:identification_guide>

我想把它解析成这样一个文本文件，其中每个属性都有重复的类ref：

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

这是我迄今为止的代码：

require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
  config.strict
end
content = doc.xpath("//ig:prescribed_item/@class_ref").map {|i|
  i.search("//ig:prescribed_item/ig:prescribed_property/@property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
  puts c.join('|')
end

我会使用CSS访问器来简化它：

xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
    <ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
        <dt:measure_number_type representation_ref="0161-1#04-000005#1">
            <dt:real_type>
                <dt:real_format pattern="d(1,).d(1,)"/>
            </dt:real_type>
            <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
        </dt:measure_number_type>
    </ig:prescribed_property>
    <ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
        <dt:measure_number_type representation_ref="0161-1#04-000005#1">
            <dt:real_type>
                <dt:real_format pattern="d(1,).d(1,)"/>
            </dt:real_type>
            <dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
        </dt:measure_number_type>
    </ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
  pi.css('|prescribed_property').each do |pp|
    data << [
      pi['class_ref'],
      pp['property_ref'],
      pp['is_required'],
      pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
    ]
  end
end
puts data.map{ |row| row.join('|') }

哪个输出：

class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1

你能更详细地解释一下"pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"吗

在Nokogiri中，有两种类型的"查找节点"方法："搜索"方法返回与特定访问者匹配的所有节点作为NodeSet，而"at"方法返回NodeSet的第一个Node，这将是第一个遇到的与访问者匹配的节点。

"搜索"方法包括search、css、xpath和/。"at"方法类似于at、at_css、at_xpath和%。search和at都接受XPath或CSS访问器。

回到pp.at_css('|prescribed_unit_of_measure')['UOM_ref']：在代码中，pp是一个包含"prescripted_property"节点的局部变量。因此，我告诉代码查找pp下与CSS |prescribed_unit_of_measure访问器匹配的第一个节点，换句话说，pp节点包含的第一个<dt:prescribed_unit_of_measure>标记。当Nokogiri找到该节点时，它返回该节点的UOM_ref属性的值。

作为FYI，/和%运算符在Nokogiri中分别别名为search和at。它们是其"Hpricot"兼容性的一部分；当Hpricot是首选的XML/HTML解析器时，我们经常使用它们，但它们对大多数Nokogiri开发人员来说并不常用。我怀疑这是为了避免与操作员的常规使用混淆，至少在我的情况下是这样。

此外，Nokogiri的CSS访问器有一些特别的趣味性；它们支持名称空间，就像XPath访问器一样，只是它们使用|。Nokogiri会让我们忽略名称空间，这就是我所做的。您可以在Nokogiri文档中查找CSS和名称空间，以获取更多信息。

基于属性的解析肯定有很多方法。

发动机厂的文章"Nokogiri入门"有一个完整的描述。

但很快，他们给出的例子是：

匹配具有类的"h3"标记属性，我们写：

h3[@class]

匹配类为"h3"的标签属性等于字符串"r"，我们写：

 h3[@class = "r"]

使用属性匹配构造，我们可以修改以前的查询到：

 //h3[@class = "r"]/a[@class = "l"]

相关内容

最新更新

热门标签：