使用Nokogiri抓取表,需要JSON输出



那么,我有一个多行多列的表。

<table>
  <tr>
    <th>Employee Name</th>
    <th>Reg Hours</th>
    <th>OT Hours</th>
  </tr>
  <tr>
    <td>Employee 1</td>
    <td>10</td>
    <td>20</td>
  </tr>
  <tr>
    <td>Employee 2</td>
    <td>5</td>
    <td>10</td>
  </tr>
</table>

还有一个表:

<table>
  <tr>
    <th>Employee Name</th>
    <th>Revenue</th>
  </tr>
    <td>Employee 2</td>
    <td>$10</td>
  </tr>
  <tr>
    <td>Employee 1</td>
    <td>$50</td>
  </tr>
</table>

请注意,员工顺序在表之间可能是随机的。

我如何使用nokogiri创建一个json文件,每个员工作为一个对象,与他们的总小时数和收入?

目前,我只能通过一些xpath获取单个表单元格。例如:

puts page.xpath(".//*[@id='UC255_tblSummary']/tbody/tr[2]/td[1]/text()").inner_text
编辑:

使用page-object gem和@Dave_McNulla的链接,我尝试了这段代码,看看我得到了什么:

class MyPage
  include PageObject
  table(:report, :id => 'UC255_tblSummary')
  def get_some_information
    report_element[1][2].text
  end
end
puts get_some_information

但是没有返回任何东西。

数据:https://gist.github.com/anonymous/d8cc0524160d7d03d37b

有一个重复的小时表。第一个没问题。另一个需要的表格是附件收入表。(我还需要激活表,但我将尝试从合并小时和附属收入表的代码合并。)

我认为一般的方法是:

  1. 为每个键为雇员的表创建一个哈希
  2. 合并两个表的结果
  3. 转换为JSON

为每个键为雇员的表创建一个哈希

这部分可以在Watir或Nokogiri中完成。只有当Watir由于大表而导致性能不佳时,使用Nokogiri才有意义。

Watir都:

#I assume you would have a better way to identify the tables than by index
hours_table = browser.table(:index, 0)
wage_table = browser.table(:index, 1)
#Turn the tables into a hash
employee_hours = {}
hours_table.trs.drop(1).each do |tr| 
    tds = tr.tds
    employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}     
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.trs.drop(1).each do |tr| 
    tds = tr.tds
    employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}   
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}

Nokogiri:

page = Nokogiri::HTML.parse(browser.html)
hours_table = page.search('table')[0]
wage_table = page.search('table')[1]
employee_hours = {}
hours_table.search('tr').drop(1).each do |tr| 
    tds = tr.search('td')
    employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}     
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.search('tr').drop(1).each do |tr| 
    tds = tr.search('td')
    employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}   
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}

合并两个表的结果

您希望将两个哈希合并在一起,以便对于特定员工,哈希将包括他们的工作时间和收入。

employee = employee_hours.merge(employee_wage){ |key, old, new| new.merge(old) }
#=> {"Employee 1"=>{"Revenue"=>"$50", "Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Revenue"=>"$10", "Reg Hours"=>"5", "OT Hours"=>"10"}}

转换成JSON

根据前面的问题,你可以将哈希值转换为json。

require 'json'
employee.to_json

相关内容

  • 没有找到相关文章

最新更新