我正在尝试用Nokogiri解析多个XML文件。它们的格式如下:
<?xml version="1.0" encoding="UTF-8"?>
<CRDoc>[Congressional Record Volume<volume>141</volume>, Number<number>213</number>(<weekday>Sunday</weekday>,<month>December</month>
<day>31</day>,<year>1995</year>)]
[<chamber>Senate</chamber>]
[Page<pages>S19323</pages>]<congress>104</congress>
<session>1</session>
<document_title>UNANIMOUS-CONSENT REQUEST--HOUSE MESSAGE ON S. 1508</document_title>
<speaker name="Mr. DASCHLE">Mr. DASCHLE</speaker>.<speaking name="Mr. DASCHLE">Mr. President, I said this on the floor yesterday
afternoon, and I will repeat it this afternoon. I know that the
distinguished majority leader wants an agreement as much as I do, and I
do not hold him personally responsible for the fact that we are not
able to overcome this impasse. I commend him for his efforts at trying
to do so again today.</speaking>
<speaking name="Mr. DASCHLE">Let me try one other option. We have already been unable to agree to
a continuing resolution that would have put all Federal employees back
to work with pay. We have been unable to agree to something that we
agreed to last Friday, the 22d of December, which would have at least
sent them back to their offices without pay. Perhaps we can try this.</speaking>
<speaking name="Mr. DASCHLE">I ask unanimous consent that the Senate proceed to the message from
the House on S. 1508, that the Senate concur in the House amendment
with a substitute amendment that includes the text of Senator Dole's
back-to-work bill, and the House-passed expedited procedures shall take
effect only if the budget agreement does not cut Medicare more than
necessary to ensure the solvency of the Medicare part A trust fund and,
second, does not raise taxes on working Americans, does not cut funding
for education or environmental enforcement, and maintains the
individual health guarantee under Medicaid and, third, provides that
any tax reductions in the budget agreement go only to Americans making
under $100,000; that the motion to concur be agreed to, and the motion
to reconsider be laid upon the table.</speaking>
<speaker name="The ACTING PRESIDENT pro tempore">The ACTING PRESIDENT pro tempore</speaker>.<speaking name="The ACTING PRESIDENT pro tempore">Is there objection?</speaking>
<speaker name="Mr. DOLE">Mr. DOLE</speaker>.<speaking name="Mr. DOLE">Mr. President, I want to say a few words. But I will
object.</speaking>
<speaking name="Mr. DOLE">We are working on a lot of these things in our meetings at the White
House, where we have both been for a number of hours. I think we have
made some progress. We are a long way from any solution yet.</speaking>
<speaking name="Mr. DOLE">I think all of the things listed by the Democratic leader are areas
of concern in the meetings we have had. And the meetings will start
again on Tuesday. But it seems to me that it would not be appropriate
to proceed under those terms, and therefore I object.</speaking>
<speaker name="The ACTING PRESIDENT pro tempore">The ACTING PRESIDENT pro tempore</speaker>.<speaking name="The ACTING PRESIDENT pro tempore">Objection is heard.</speaking>
</CRDoc>
我使用的代码来自以前的帮助,到目前为止已经工作了一段时间。但是,XML文件的格式发生了变化,导致代码无法使用。我的代码是:
doc.xpath("//speech/speaking/@name").map(&:text).uniq.each do |name|
speaker = Nokogiri::XML('<root/>')
doc.xpath('//speech').each do |speech|
speech_node = Nokogiri::XML('<speech/>')
speech.xpath("*[@name='#{name}']").each do |speaking|
speech_node.root.add_child(speaking)
end
speaker.root.add_child(speech_node.root) unless speech_node.root.children.empty?
end
File.open("test/" + name + "-" + year + ".xml", 'a+') do |f|
f.write speaker.root.children
end
end
我想为每个演讲者创建一个新的XML文件,并在每个新的XML文件中包含他们所说的内容。代码需要能够循环遍历目录中的各种XML文件,并将每个演讲放到适当的演讲者文件中。我想这可以通过find -exec
命令来完成。
- 创建一个包含演讲者姓名和年份的XML文件,即
Mr. Boehner_2011.xml
- XML文件将保存他当年的所有演讲。
- XML文件将有一个
CRDoc
根节点
我的建议是,与其继续使用你不理解的代码,不如把它分解成小块,这样更容易理解,或者至少更容易分离出问题。
想象一下可以这样做:
crdoc = CongressionalRecordDocument.new(filename)
crdoc.year
#=> 1995
crdoc.speakers
#=> ["Mr. DASCHLE", "The ACTING PRESIDENT pro tempore", "Mr. DOLE"]
crdoc.speakers.each do |speaker|
speech = crdoc.speaking_parts(speaker)
#save speech to file
end
这隐藏了细节,使它更容易阅读。更好的是,它将它们划分开来,因此,如果您检索演讲者列表的方式发生变化,例如,您只需要更改一小部分,并且该部分将易于测试。让我们来实现它:
class CongressionalRecordDocument
def initialize(xml_file)
@doc = Nokogiri::XML(xml_file)
end
def year
@year ||= @doc.at('//year')
end
def speakers
@speakers ||= @doc.xpath('//speaker/@name').map(&:text).uniq
end
def speaking_parts(speaker)
@doc.xpath("//speaking[@name = '#{speaker}']").map(&:text)
end
end
现在看起来没那么复杂了,不是吗?您可能还想以类似的方式为new文档创建一个类,这样创建输出就很简单了。
此外,您可能希望在ruby中找到您的文件,而不是find -exec
:
Dir["/path/to/search/*.xml"].each do |file|
crdoc = CongressionalRecordDocument.new(file)
#etc
end
由于您不再拥有<speech>
元素,您需要将其从代码中删除:
doc.xpath("//speaking/@name").map(&:text).uniq.each do |name|
speaker = Nokogiri::XML('<root/>')
doc.xpath('//CRDoc').each do |speech|
speech_node = Nokogiri::XML('<speech/>')
speech.xpath("*[@name='#{name}']").each do |speaking|
speech_node.root.add_child(speaking)
end
speaker.root.add_child(speech_node.root) unless speech_node.root.children.empty?
end
File.open("test/" + name + "-" + year + ".xml", 'a+') do |f|
f.write speaker.root.children
end
end