需要帮助查找具有类的元素的文本



我有一个使用命令page.css("table.vc_result span a")获得的文件,我无法获得该文件的第二个和第三个span元素:

文件

<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
  <tr>
    <td width="260" valign="top">
      <table>
        <tbody>
          <tr>
            <td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
            Gateway Megatech</a></span><br>
            <span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
          </tr>
          <tr>
            <td><span class="cAddText">Cook County Illinois</span></td>
          </tr>
          <tr>
            <td><span class="cAddCategory">Yellow Page Advertising And Telephone
            Directory Publica Chicago</span></td>
          </tr>
        </tbody>
      </table>
    </td>
    <td width="260">
      <table align="center">
        <tbody>
          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td>
                      <div style=
                      "background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
                      </div>
                    </td>
                    <td><font style="font-weight:bold">847-506-7800</font></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>
          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td>
                      <div style=
                      "background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
                      </div>
                    </td>
                    <td><a href=
                    "/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
                    class="cAddNearby">Businesses near 60696</a></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>
          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>
        </tbody>
      </table>
    </td>
  </tr>
</tbody>
</table>

这不是一个完整的文件,该文件中还有很多span条目。

我使用的代码能够定位准确的文本,但无法将其与嵌套元素Span A.的文本相关联

require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"
burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}&current=1&Submit=Search"
page = Nokogiri::HTML(open(url)) 
rows = page.css("table.vc_result span a")
rows.each do |arow|
  if arow.text == "Gateway Megatech"
    puts(arow.next_element.text)
    puts("Capturing the next span text")
    found="Got it"
    break       
  else
    puts("Found nothing")
    found="None"
  end
end

假设每个业务都是您提供的顶级表中的一个新的<tr>,下面的代码将为您提供一个值为的哈希数组:

require 'nokogiri'
doc = Nokogiri.HTML(html)
business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
  # Inside the first <td> of the row, find a <td> with a.cAddName in it
  business = tr.at_xpath('td[1]//td[//a[@class="cAddName"]]')
  name     = business.at_css('a.cAddName').text.strip
  address  = business.at_css('.cAddText').text.strip
  # Inside the second <td> of the row, find the first <font> tag
  phone    = tr.at_xpath('td[2]//font').text.strip
  # Return a hash of values for this row, using the capitalization requested
  { Name:name, Address:address, Phone:phone }
end
p details
#=> [
#=>   {
#=>     :Name=>"Gateway Megatech",
#=>     :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=>     :Phone=>"847-506-7800"
#=>   }
#=> ]

这是非常脆弱的,但适用于您所提供的内容,而且在这种疯狂、可怕的HTML滥用中,似乎没有太多的语义项可以保留。

用正则表达式解析HTML是个坏主意,因为HTML不是一种正则语言。理想情况下,您希望将DOM/XML解析为树结构。

http://nokogiri.org/非常受欢迎。

相关内容

  • 没有找到相关文章

最新更新