尝试处理来自TeleForm应用程序的一些XML。这是一个表单扫描软件,它获取数据并将其放入XML中。这是XML 的一个片段
<?xml version="1.0" encoding="ISO-8859-1"?>
<Records>
<Record>
<Field id="ImageFilename" type="string" length="14"><Value>00000022000000</Value></Field>
<Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
<Field id="Withdrew" type="string" length="1"></Field>
</Record>
<Record>
<Field id="ImageFilename" type="string" length="14"><Value>00000022000001</Value></Field>
<Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
<Field id="Withdrew" type="string" length="1"></Field>
</Record>
</Records>
我在另一个系统中处理过这个问题,可能使用了我们编写的自定义解析器。我以为这在Rails中不会有问题,但我错了。
用Hash.from_xml或Nokogiri解析它并没有给我预期的结果,我得到了:
{"Records"=>{"Record"=>[{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]},
{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]}]}}
在花了太多时间之后,我发现如果我去掉了类型和长度属性,我会得到我期望的结果(即使它是错误的!我只删除了第一个记录节点)。
{"Records"=>{"Record"=>[{"Field"=>[{"id"=>"ImageFilename", "Value"=>"00000022000000"},
{"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, {"id"=>"Withdrew"}]},
{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]}]}}
由于不精通XML,我认为这种使用类型和长度属性的XML风格正试图转换为数据类型。在这种情况下,我可以理解为什么"Withdraw"属性显示为空,但不理解为什么"ImageFilename"是空的——它是一个14个字符的字符串。
我已经解决了gsub的问题,但这是无效的XML吗?添加DTD(TeleForm应该提供)会给我不同的结果吗?
编辑
我将为我自己的问题提供一个可能的答案,并提供一些代码作为编辑。代码遵循了我从Mark Thomas那里收到的一个答案中的一些功能,但我决定不使用Nokogiri,原因如下:
- xml是一致的,并且总是包含相同的标记(/Record/Record/Field)和属性
- 每个XML文件中可能有几百条记录,而Nokogiri似乎有点慢,只有26条记录
- 我想好了如何获得Hash.from_xml来提供我所期望的(不喜欢type="string",但只使用哈希来填充一个类
XML的扩展版本,带有一个完整的记录
<?xml version="1.0" encoding="ISO-8859-1"?>
<Records>
<Record>
<Field id="ImageFilename" type="string" length="14"><Value>00000022000000</Value></Field>
<Field id="DocID" type="string" length="15"><Value>731192AIINSC</Value></Field>
<Field id="FormID" type="string" length="6"><Value>AIINSC</Value></Field>
<Field id="Availability" type="string" length="18"><Value>M T W H F S</Value></Field>
<Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_2" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_3" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_4" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_5" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_6" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_7" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_8" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_9" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_10" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_11" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_12" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_13" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_14" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_15" type="number" length="2"><Value>0</Value></Field>
<Field id="DayTraining" type="string" length="1"><Value>Y</Value></Field>
<Field id="SaturdayTraining" type="string" length="1"></Field>
<Field id="CitizenStageID" type="string" length="12"><Value>731192</Value></Field>
<Field id="NoShow" type="string" length="1"></Field>
<Field id="NightTraining" type="string" length="1"></Field>
<Field id="Withdrew" type="string" length="1"></Field>
<Field id="JobStageID" type="string" length="12"><Value>2292</Value></Field>
<Field id="DirectHire" type="string" length="1"></Field>
</Record>
</Records>
我只是在试验一个工作流原型,以取代用4D和Active4D编写的老化系统。处理TeleForms数据的这一领域是作为一个批处理操作实现的,它仍然可能恢复到那个领域。我只是试图将一些旧的可行概念合并到一个新的Rails实现中。XML文件位于共享服务器上,可能需要移动到web根目录中,然后设置一些触发器来处理文件。
我仍处于定义阶段,但我处理面试表单的模块/类看起来是这样的,可能会发生变化(几乎没有错误捕获,仍在尝试进行测试,在使用Rails大约5年后,我的Ruby并没有达到应有的水平!):
module Teleform::InterviewForm
class Form < Prawn::Document
# Not relevant to this question, but this class generates the forms from a Fillable PDF template and
# relavant Model(s) data.
# These forms, when completed are what is processsed by TeleForms and produces the xml.
end
class RateForms
attr_accessor :records, :results
def initialize(xml_path)
fields = []
xml = File.read(xml_path)
# Hash.from_xml does not like a type of "string"
hash = Hash.from_xml(xml.gsub(/type="string"/,'type="text"'))
hash["Records"]["Record"].each do |record|
#extract the field form each record
fields << record["Field"]
end
@records = []
fields.each do |field|
#build the records for the form
@records << Record.new(field)
end
@results = rate_records
end
def rate_records
# not relevant to the qustions but this is where the data is processed and a bunch of stuff takes place
return "Any errors"
end
end
class Record
attr_accessor(*[:image_filename, :doc_id, :form_id, :availability, :criterion_1, :criterion_2,
:criterion_3, :criterion_4, :criterion_5, :criterion_6, :criterion_7, :criterion_8,
:criterion_9, :criterion_10, :criterion_11, :criterion_12, :criterion_13, :criterion_14, :criterion_15,
:day_training, :saturday_training, :citizen_stage_id, :no_show, :night_training, :withdrew, :job_stage_id, :direct_hire])
def initialize(fields)
fields.each do |field|
if field["type"] == "number"
try("#{field["id"].underscore.to_sym}=", field["Value"].to_i)
else
try("#{field["id"].underscore.to_sym}=", field["Value"])
end
end
end
end
end
感谢您添加额外信息,说明这是对受访者的评分。在你的代码中使用这些域信息可能会改善它。你还没有发布任何代码,但通常使用域对象会产生更简洁、更可读的代码。我建议创建一个表示Rating
的简单类,而不是将数据从XML转换为数据结构。
class Rating
attr_accessor :image_filename, :criterion_1, :withdrew
end
使用上面的类,这里有一种使用Nokogiri从XML中提取字段的方法。
doc = Nokogiri::XML(xml)
ratings = []
doc.xpath('//Record').each do |record|
rating = Rating.new
rating.image_filename = record.at('Field[@id="ImageFilename"]/Value/text()').to_s
rating.criterion_1 = record.at('Field[@id="Criterion_1"]/Value/text()').to_s
rating.withdrew = record.at('Field[@id="Withdrew"]/Value/text()').to_s
ratings << rating
end
现在,ratings
是Rating
对象的列表,每个对象都有检索数据的方法。这比深入研究深层数据结构要干净得多。您甚至可以进一步改进Rating
类,例如创建一个返回true或false的withdrew?
方法。
与不可靠且不一致的Hash.from_xml
实现相比,XmlSimple(由maik提供)似乎更适合此任务。
一个经过测试的同名perl模块的端口,它有几个显著的优点。
- 无论您发现一个节点出现一次还是多次,它都是一致的
- 不会阻塞和混淆结果
- 能够区分属性和节点内容
通过解析器运行上述相同的xml文档:
XmlSimple.xml_in xml
将产生以下结果。
{"Record"=>
[{"Field"=>
[{"id"=>"ImageFilename", "type"=>"string", "length"=>"14", "Value"=>["00000022000000"]},
{"id"=>"DocID", "type"=>"string", "length"=>"15", "Value"=>["731192AIINSC"]},
{"id"=>"FormID", "type"=>"string", "length"=>"6", "Value"=>["AIINSC"]},
{"id"=>"Availability", "type"=>"string", "length"=>"18", "Value"=>["M T W H F S"]},
{"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_2", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_3", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_4", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_5", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_6", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_7", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_8", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_9", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_10", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_11", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_12", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_13", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_14", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_15", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"DayTraining", "type"=>"string", "length"=>"1", "Value"=>["Y"]},
{"id"=>"SaturdayTraining", "type"=>"string", "length"=>"1"},
{"id"=>"CitizenStageID", "type"=>"string", "length"=>"12", "Value"=>["731192"]},
{"id"=>"NoShow", "type"=>"string", "length"=>"1"},
{"id"=>"NightTraining", "type"=>"string", "length"=>"1"},
{"id"=>"Withdrew", "type"=>"string", "length"=>"1"},
{"id"=>"JobStageID", "type"=>"string", "lth"=>"12", "Value"=>["2292"]},
{"id"=>"DirectHire", "type"=>"string", "length"=>"1"}]
}]
}
我正在考虑解决这个问题,并为Hash提供from_xml
的工作实现,并希望从其他得出相同结论的人那里找到一些反馈。当然,我们并不是唯一有这些挫折感的人。
与此同时,我们可能会从知道有比Nokogiri
更轻的东西和它的全厨房水槽来完成这项任务中找到安慰。
nJoy!