如何通过多个本地 HTML 文件循环我的 ruby / nokogiri 解析器并将结果输出到一个 CSV 文件



我刚刚编写了我的第一个Ruby程序,这是一个简单的解析器。我计划使用 ruby 和 nokogiri 解析一组大约 200 个本地.htm文件,并将所有内容输出到单个.csv文件中。

本地文件按如下方式组织:

rootregion_name1city_name1.htm
rootregion_name1city_name2.htm
rootregion_name1city_name3.htm
rootregion_name2city_name1.htm
...

上述.htm文件中的相关 html 源代码如下所示:

<div class="media-body">
    <h4 class="list-group-item-heading"><a ng-href="#/clubs/2001103" class="ng-binding" href="http://www.vereinssuche-nrw.de/#/clubs/2001103">DJK Arminia Eilendorf 1919 e. V.</a> <small ng-show="item.distance > 0" class="ng-binding" style="display: none;">0 km</small></h4>
        <div class="row">
            <div class="col-12 col-lg-6 ng-binding">
                <span ng-show="item.geoadresse.strasse" class="ng-binding">Ulmenstraße 12<br></span>52080 Aachen<br>
                <a ng-href="tel:0241 551424" ng-show="item.telefon" class="ng-binding" href="unsafe:tel:0241 551424">Tel.: 0241 551424<br></a>
                <a ng-href="http://www.DJK-Arminia-Eilendorf.de" ng-show="item.webseite" target="_blank" class="ng-binding" href="http://www.djk-arminia-eilendorf.de/">http://www.DJK-Arminia-Eilendorf.de</a>
            </div>
                <div class="col-lg-6 col-12 visible-lg event-list">
                    <b>Veranstaltungen</b>
                    <!-- ngRepeat: event in item.veranstaltungen | limitTo:3 -->
                <div ng-show="item.veranstaltungen.length == 0" class="text-muted">Keine Veranstaltungen angekündigt.</div>
            <div>
        </div>
</div>

我的 ruby 代码适用于单个 .htm 文件,并通过 XPath 解析/提取我需要的数据。我想自动执行整个过程,而不是解析每个文件并合并所有 200 个.htm文件的输出.csv文件,但我无法真正弄清楚如何做到这一点。

这是我的红宝石代码:

require 'rubygems'
require 'nokogiri'
require 'csv'
# define arrays including a dummy array which is needed for reasons i do not yet know :P
# remember that you can easily adapt this parser to suit your needs by defining additional variables
# and by adding additional xpath steps (doc.xpath...) below
name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new
doc = Nokogiri::HTML(open("aachen.htm"))
puts doc.class   # => Nokogiri::HTML::Document
# search elements via xpath and collect contents in arrays
name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
plzort.delete("")
# generate CSV file output.csv and force UTF-8
CSV.open("output.csv", "wb:UTF-8") do |csv|
    # prepopulate CSV file with column headings
    csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
    # repeat extraction process until name array returns nothing i.e. no more elements on page
    until name.empty?
        # write everything to CSV file
        csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
  end
end

我已经通读了 ruby 和 nokogiri 文档,但唉,我不知道如何继续。

以下是我编写代码部分的方式:

name = Array.new
strasse = Array.new
plzort = Array.new
tel = Array.new
website = Array.new
dummy = Array.new

可以写得更清楚,例如:

name = []
strasse = []
plzort = []
tel = []
website = []
dummy = []

但是,没有必要在 Ruby 中初始化变量。相反,请直接分配给他们...

name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node| node.text.strip}
plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node| node.text.strip}
tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}

会这样做,但这是不优雅和浪费的。相反,请使用如下所示的内容:

name, strasse, plzort, tel, website, dummy = [
  "//div/h4/a"
  "//div/span[contains(@ng-show,'item.geoadresse.strasse')]"
  "//div[@id='searchResults']/div/div/div/div/div[1]/text()"
  "//div/a[contains(@ng-show,'item.telefon')]"
  "//div/a[contains(@ng-show,'item.webseite')]"
  "//*[@id='searchResults']/div[39]/div/div/div/div[1]/br"
].map { |s|
  doc.xpath(s).collect {|node| node.text.strip}
}

XPath 成为循环访问的数组中的数据,每次都执行相同的操作。它使代码更容易理解和维护。

plzort.delete("")

不会做你认为它会做的事。分配plzort时,它将是一个不知道如何delete("")的 NodeSet

plzort = doc.xpath('//bar')
plzort.delete("") # => 
# ~> -:9:in `delete': node must be a Nokogiri::XML::Node or Nokogiri::XML::Namespace (ArgumentError)
# ~>  from -:9:in `<main>'

最简单的方法可能是将所有文件移动到一个目录中。然后,您可以使用 Dir.foreach 循环遍历该目录中的条目,并稍微更改当前脚本以将结果附加到输出文件中。

假设您的脚本现在适用于一个文件,一旦循环遍历目录中的所有文件,请将硬编码文件名替换为迭代器变量名称,并将输出文件的模式从"wb"(写入)更改为"ab"(追加)

Dir.foreach('rootregion_name1') do |file|
   name = Array.new
   strasse = Array.new
   plzort = Array.new
   tel = Array.new
   website = Array.new
   dummy = Array.new
   doc = Nokogiri::HTML(open("#{file}"))   #Instead of hardcoding filename use iterator variable.
   puts doc.class   # => Nokogiri::HTML::Document
   # search elements via xpath and collect contents in arrays
   name = doc.xpath("//div/h4/a").collect {|node| node.text.strip}
   strasse = doc.xpath("//div/span[contains(@ng-show,'item.geoadresse.strasse')]").collect {|node|      node.text.strip}
   plzort = doc.xpath("//div[@id='searchResults']/div/div/div/div/div[1]/text()").collect {|node|   node.text.strip}
   tel = doc.xpath("//div/a[contains(@ng-show,'item.telefon')]").collect {|node| node.text.strip}
   website = doc.xpath("//div/a[contains(@ng-show,'item.webseite')]").collect {|node| node.text.strip}
   dummy = doc.xpath("//*[@id='searchResults']/div[39]/div/div/div/div[1]/br").collect {|node| node.text.strip}
   plzort.delete("")
  # generate CSV file output.csv and force UTF-8
  CSV.open("output.csv", "ab:UTF-8") do |csv|          #Change to ab to append to output file instead of overwrite
  # prepopulate CSV file with column headings
  csv << ["name", "strasse", "plzort", "tel", "website", "dummy"]
  # repeat extraction process until name array returns nothing i.e. no more elements on page
  until name.empty?
    # write everything to CSV file
    csv << [name.shift, strasse.shift, plzort.shift, tel.shift, website.shift, dummy.shift]
  end
 end
end

如果您有很多目录,并且无法将所有.htm文件移动到一个位置,则相同的逻辑将适用,但您首先必须遍历它们的父目录,然后遍历每个子目录中的每个.htm文件:

Dir.foreach("parent_directory") do |folder|
    Dir.foreach("#{folder}"} do |file|
       # insert script here
    end
end

Dir 和 FileUtils 模块对于循环访问文件和文件夹非常有用。

相关内容

  • 没有找到相关文章

最新更新