如何抓取忽略嵌入式标签的数据


<div class="seperate">
    <h2>Public info</h2>
    <p>
        <strong>Property type:</strong> Semi-detached house |
        <strong>Tenure:</strong> Leasehold |
        <strong>Last sale:</strong> £71,000 | <strong>Sale date:</strong> 5th Dec 2007 - <a href="" class="toggle_sold_prices">Previous sales</a>
        <span id="sold-prices" class="none">
                        <br>
                            <strong>Property type:</strong>
                            Semi-detached house | 
                            <strong>Tenure:</strong>
                            Leasehold | 
                        <strong>Previous sale:</strong> £75,000 | 
                        <strong>Sale date:</strong> 
     3rd Oct 2006
                        <br>
                            <strong>Property type:</strong>
                            Semi-detached house | 
                            <strong>Tenure:</strong>
                            Leasehold | 
                        <strong>Previous sale:</strong> £36,000 | 
                        <strong>Sale date:</strong> 
    26th Sep 2002
                        <br>
                            <strong>Property type:</strong>
                            Semi-detached house | 
                            <strong>Tenure:</strong>
                            Leasehold | 
                        <strong>Previous sale:</strong> £39,950 | 
                        <strong>Sale date:</strong> 
    27th Jan 1995
                            <span class="new-build">New build</span>
        </span>
        | <a href="/for-sale/details/42175871"><i class="icon icon-home nolink"></i>Currently for sale</a>
    </p>
</div>

我正在尝试抓取"最后销售"、"销售日期"和"当前待售"值的数据,但里面的所有内容除外

<span id="sold-prices" class="none">

我知道我能做到

html.search(".//div[@class='separate']")

在单独的div 中获取 HTML,但我不知道如何抓取我想要的标签的数据。有什么想法吗?

Nokogiri 完成 HTML 处理后,查找和操作节点真的很容易。有时这意味着有选择地删除节点以简化 DOM。这是其中一次:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="seperate">
  <p>
    <strong>Property type:</strong> Semi-detached house |
    <strong>Tenure:</strong> Leasehold |
    <strong>Last sale:</strong> £71,000 | <strong>Sale date:</strong> 5th Dec 2007 - <a href="" class="toggle_sold_prices">Previous sales</a>
    <span id="sold-prices" class="none">
      <br>
          <strong>Property type:</strong>
          Semi-detached house | 
          <strong>Tenure:</strong>
          Leasehold | 
    </span>
  </p>
</div>
EOT
doc.at('#sold-prices').remove
data = doc.search('strong').map{ |strong|
    [strong.text, strong.next_sibling.text.tr('|', '').strip]
}.to_h
data # => {"Property type:"=>"Semi-detached house", "Tenure:"=>"Leasehold", "Last sale:"=>"£71,000", "Sale date:"=>"5th Dec 2007 -"}

诀窍是:

doc.at('#sold-prices').remove

摆脱了森林,所以你可以看到你想要的树木。

清理结果数据还需要一点点,但代码的其余部分应该是不言自明的,因此调整它对您来说应该很容易。

相关内容

  • 没有找到相关文章

最新更新