我刚刚编写了我的第一个Ruby程序,这是一个简单的解析器。我计划使用 ruby 和 nokogiri 解析一组大约 200 个本地.htm文件,并将所有内容输出到单个.csv文件中。
本地文件按如下方式组织:
rootregion_name1city_name1.htm
rootregion_name1city_name2.htm
rootregion_name1city_name3.htm
rootregion_name2city_name1.htm
...
上述.htm文件中的相关 html 源代码如下所示:
<div class="media-body">
<h4 class="list-group-item-heading"><a ng-href="#/clubs/2001103" class="ng-binding" href="http://www.vereinssuche-nrw.de/#/clubs/2001103">DJK Arminia Eilendorf 1919 e. V.</a> <small ng-show="item.distance > 0" class="ng-binding" style="display: none;">0 km</small></h4>
<div class="row">
<div class="col-12 col-lg-6 ng-binding">
<span ng-show="item.geoadresse.strasse" class="ng-binding">Ulmenstraße 12<br></span>52080 Aachen<br>
<a ng-href="tel:0241 551424" ng-show="item.telefon" class="ng-binding" href="unsafe:tel:0241 551424">Tel.: 0241 551424<br></a>
<a ng-href="http://www.DJK-Arminia-Eilendorf.de" ng-show="item.webseite" target="_blank" class="ng-binding" href="http://www.djk-arminia-eilendorf.de/">http://www.DJK-Arminia-Eilendorf.de</a>
</div>
<div class="col-lg-6 col-12 visible-lg event-list">
<b>Veranstaltungen</b>
<!-- ngRepeat: event in item.veranstaltungen | limitTo:3 -->
<div ng-show="item.veranstaltungen.length == 0" class="text-muted">Keine Veranstaltungen angekündigt.</div>
<div>
</div>
</div>
我的 ruby 代码适用于单个 .htm 文件,并通过 XPath 解析/提取我需要的数据。我想自动执行整个过程,而不是解析每个文件并合并所有 200 个.htm文件的输出.csv文件,但我无法真正弄清楚如何做到这一点。
这是我的红宝石代码:
require 'rubygems'
require 'nokogiri'
require 'csv'
# define arrays including a dummy array which is needed for reasons i do not yet know :P
# remember that you can easily adapt this parser to suit your needs by defining additional variables
# and by adding additional xpath steps (doc.xpath...) below
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
doc = Nokogiri::HTML(open("aachen.htm"))
puts doc.class # => Nokogiri::HTML::Document
# search elements via xpath and collect contents in arrays
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
plzort.delete("")
# generate CSV file output.csv and force UTF-8
CSV.open("output.csv", "wb:UTF-8") do |csv|
# prepopulate CSV file with column headings
csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
# repeat extraction process until name array returns nothing i.e. no more elements on page
until name.empty?
# write everything to CSV file
csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
end
end
我已经通读了 ruby 和 nokogiri 文档,但唉,我不知道如何继续。
以下是我编写代码部分的方式:
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
可以写得更清楚,例如:
name = []
strasse = []
plzort = []
tel = []
website = []
dummy = []
但是,没有必要在 Ruby 中初始化变量。相反,请直接分配给他们...
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
会这样做,但这是不优雅和浪费的。相反,请使用如下所示的内容:
name, strasse, plzort, tel, website, dummy = [
"//div/h4/a"
"//div/span[contains(@ng-show,'item.geoadresse.strasse')]"
"//div[@id='searchResults']/div/div/div/div/div[1]/text()"
"//div/a[contains(@ng-show,'item.telefon')]"
"//div/a[contains(@ng-show,'item.webseite')]"
"//*[@id='searchResults']/div[39]/div/div/div/div[1]/br"
].map { |s|
doc.xpath(s).collect {|node| node.text.strip}
}
XPath 成为循环访问的数组中的数据,每次都执行相同的操作。它使代码更容易理解和维护。
plzort.delete("")
不会做你认为它会做的事。分配plzort
时,它将是一个不知道如何delete("")
的 NodeSet
plzort = doc.xpath('//bar')
plzort.delete("") # =>
# ~> -:9:in `delete': node must be a Nokogiri::XML::Node or Nokogiri::XML::Namespace (ArgumentError)
# ~> from -:9:in `<main>'
最简单的方法可能是将所有文件移动到一个目录中。然后,您可以使用 Dir.foreach
循环遍历该目录中的条目,并稍微更改当前脚本以将结果附加到输出文件中。
假设您的脚本现在适用于一个文件,一旦循环遍历目录中的所有文件,请将硬编码文件名替换为迭代器变量名称,并将输出文件的模式从"wb"
(写入)更改为"ab"
(追加)
Dir.foreach('rootregion_name1') do |file|
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
doc = Nokogiri::HTML(open("#{file}")) #Instead of hardcoding filename use iterator variable.
puts doc.class # => Nokogiri::HTML::Document
# search elements via xpath and collect contents in arrays
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
plzort.delete("")
# generate CSV file output.csv and force UTF-8
CSV.open("output.csv", "ab:UTF-8") do |csv| #Change to ab to append to output file instead of overwrite
# prepopulate CSV file with column headings
csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
# repeat extraction process until name array returns nothing i.e. no more elements on page
until name.empty?
# write everything to CSV file
csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
end
end
end
如果您有很多目录,并且无法将所有.htm文件移动到一个位置,则相同的逻辑将适用,但您首先必须遍历它们的父目录,然后遍历每个子目录中的每个.htm文件:
Dir.foreach("parent_directory") do |folder|
Dir.foreach("#{folder}"} do |file|
# insert script here
end
end
Dir 和 FileUtils 模块对于循环访问文件和文件夹非常有用。