如何从基于其他标记的标记中提取数据



我有以下示例文档:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>Doe</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>DOE</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>222222222</irs:SSN>
</EmployeeInfoGrp>
</Form1095CUpstreamDetail>
</n1:Form109495CTransmittalUpstream>

使用Nokogiri,我想基于<RecordId>提取每个<Form1095CUpstreamDetail><PersonFirstNm>, <PersonLastNm><irs:SSN>之间的值。

我也试过删除名称空间。我发布了一个小片段,但是我尝试了许多遍历XML的迭代都没有成功。这是我第一次使用XML,所以我意识到我可能遗漏了一些简单的东西。

当我设置XPath时:

require 'nokogiri'
submission_doc = Nokogiri::XML(open('1094C_Request.xml'))
submissions = submission_doc.remove_namespaces
nodes = submission.xpath('//Form1095CUpstreamDetail')

我似乎没有RecordId和上面提到的标签之间的任何关联,我被困在下一步要去哪里。

字段没有被列为RecordId的子字段,所以我想不出如何接近获得它们的值。我包括完整的文件作为一个例子,以确保我不排除任何东西。

我有一个值数组,如果RecordId包含在数字数组中,我想拉出上面提到的三个标记。

Nokogiri可以很容易地完成您想做的事情(假设XML语法正确)。我会这样做:

require 'nokogiri'
require 'pp'
doc = Nokogiri::XML(<<EOT)
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
  <Form1095CUpstreamDetail RecordType="String" lineNum="1">
    <RecordId>1</RecordId>
    <PersonFirstNm>JOHN</PersonFirstNm>
    <PersonLastNm>Doe</PersonLastNm>
    <irs:SSN>123456790</irs:SSN>
  </Form1095CUpstreamDetail>
  <Form1095CUpstreamDetail RecordType="String" lineNum="1">
    <RecordId>2</RecordId>
    <PersonFirstNm>JANE</PersonFirstNm>
    <PersonLastNm>DOE</PersonLastNm>
    <irs:SSN>222222222</irs:SSN>
  </Form1095CUpstreamDetail>
</Form109495CTransmittalUpstream>
EOT
info = doc.search('Form1095CUpstreamDetail').map{ |form|
  {
    record_id:       form.at('RecordId').text,
    person_first_nm: form.at('PersonFirstNm').text,
    person_last_nm:  form.at('PersonLastNm').text,
    ssn:             form.at('irs|SSN').text
  }
}
pp info
# >> [{:record_id=>"1",
# >>   :person_first_nm=>"JOHN",
# >>   :person_last_nm=>"Doe",
# >>   :ssn=>"123456790"},
# >>  {:record_id=>"2",
# >>   :person_first_nm=>"JANE",
# >>   :person_last_nm=>"DOE",
# >>   :ssn=>"222222222"}]

虽然使用XPath可以做到这一点,但是Nokogiri的CSS选择器实现倾向于产生更容易读取的选择器,这意味着更容易维护,这是一件非常好的事情。

您将看到'irs|SSN'|的使用,这是Nokogiri为CSS定义命名空间的方式。

首先xml验证器报告错误

XPath查询的默认(无前缀)命名空间URI总是'',它不能被重新定义为'urn:us:gov:treasury:irs:ext:aca:air:7.0'。

所以你必须设置这个默认的XMLNS为"。

你可以使用这个代码。

require 'nokogiri'
doc = Nokogiri::XML(open('1094C_Request.xml'))
doc.namespaces['xmlns'] = ''
details = doc.xpath("//:Form1095CUpstreamDetail")
elem_a = ["PersonFirstNm", "PersonLastNm", "irs:SSN"]
output = details.each_with_object({}) do |element, exp|
  exp[element.xpath("./:RecordId").text] = elem_a.each_with_object({}) do |elem_n, exp_h|
    exp_h[elem_n] = element.xpath(".//#{elem_n.include?(':') ? elem_n : ":#{elem_n}"}").text
  end
end

p output
# {
#   "1" => {"PersonFirstNm" => "JOHN", "PersonLastNm" => "Doe", "irs:SSN" => "123456790"},
#   "2" => {"PersonFirstNm" => "JANE", "PersonLastNm" => "DOE", "irs:SSN" => "222222222"}
# }

希望对大家有所帮助

相关内容

  • 没有找到相关文章

最新更新