获取属性化的 html 元素



我正在尝试从该站点获取包含MMEL代码内容的表格,并尝试使用CSS选择器完成它。

到目前为止,我得到的是:

require_relative 'sources/Downloader'
require 'nokogiri'
html_content = Downloader.download_page('http://www.s-techent.com/ATA100.htm')
parsed_html = Nokogiri::HTML(html_content)
tmp = parsed_html.css("tr[*]")
puts tmp.text

而且我在尝试使用属性获取此tr时遇到错误。我怎样才能完成此任务以简单形式获取此表,因为我想将其解析为 JSON。最好分部分获取它并在.each块中调用它。


编辑:如果我能像这样把东西放在块中,我会很高兴(查看页面源代码)

<TR><TD WIDTH="10%" VALIGN="TOP" ROWSPAN=5>
<B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">11</B></FONT></TD>
<TD WIDTH="40%" VALIGN="TOP"  COLSPAN=2>
<B><FONT FACE="Arial" SIZE=2><P>PLACARDS AND MARKINGS</B></FONT></TD>
<TD WIDTH="50%" VALIGN="TOP">
<FONT FACE="Arial" SIZE=2><P ALIGN="LEFT">All procurable placards, labels, etc., shall be included in the illustrated Parts Catalog.  They shall be illustrated, showing the part number, Legend and Location.  The Maintenance Manual shall provide the approximate Location (i.e., FWD -UPPER -RH) and illustrate each placard, label, marking, self -illuminating sign, etc., required for safety information, maintenance significant information or by government regulations.  Those required by government regulations shall be so identified.</FONT></TD>
</TR>

这应该在第 96 行打印来自源代码的所有 TR。 该页面中有三个表格,table[1]包含您需要的所有文本:

require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.css("table")[1].css("tr").each do |i|
  puts i #=> prints the exact html between TR tags (including)
  puts i.text #=> prints the text
end

例如:

puts doc.css("table")[1].css("tr")[2] 

打印以下内容:

<tr>
<td valign="TOP" colspan="3">
<b><font face="Arial" size="2"><p align="CENTER">GROUP DEFINITION - AIRCRAFT</p></font></b>
</td>
<td valign="TOP">
<font face="Arial" size="2"><p align="LEFT">The complete operational unit.  Includes dimensions and
areas, lifting and shoring,    leveling and weighing, towing and taxiing, parking and mooring, requi
red placards, servicing.</p></font>
</td>
</tr>
您也可以

使用 xpath 执行相同的操作:

以下是OP帖子中给出的网页第一个表格中的内容:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.xpath('(//table)[1]/tr').each do |tr|
  puts tr.to_html(:encoding => 'utf-8')
end

输出:

  <tr>
  <td width="33%" valign="MIDDLE" colspan="2">
  <p><img src="S-Tech-Logo-Blue2.gif" width="274" height="127"></p>
  </td>
  <td width="67%" valign="MIDDLE">
  <b><i><font face="Arial" color="#0000ff">
  <p align="CENTER"><big>AIRCRAFT PARTS MANUFACTURING ASSISTANCE (PMA)</big><br><big>DAR SERVICES</big></p></font></i></b>
  </td>
  </tr>

现在,如果要收集最后的表行,请执行以下操作:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
p doc.xpath('(//table)[3]/tr').to_a.size # => 1
doc.xpath('(//table)[3]/tr').each do |tr|
  puts tr.to_html(:encoding => 'utf-8')
end

输出:

<tr>
<td width="40%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">149 AZALEA CIRCLE • LIMERICK, PA 19468-1330</font></b></p>
</td>
<td width="30%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">610-495-6898 (Office) • 484-680-0507 (Cell)</font></b></p>
</td>
<td width="110%" valign="TOP" height="10">
<p align="CENTER"><a href="Contact.htm"><b><font face="Arial" size="2">E-mail S-Tech</font></b></a></p>
</td>
</tr>

相关内容

  • 没有找到相关文章

最新更新