<div class="seperate">
<h2>Public info</h2>
<p>
<strong>Property type:</strong> Semi-detached house |
<strong>Tenure:</strong> Leasehold |
<strong>Last sale:</strong> £71,000 | <strong>Sale date:</strong> 5th Dec 2007 - <a href="" class="toggle_sold_prices">Previous sales</a>
<span id="sold-prices" class="none">
<br>
<strong>Property type:</strong>
Semi-detached house |
<strong>Tenure:</strong>
Leasehold |
<strong>Previous sale:</strong> £75,000 |
<strong>Sale date:</strong>
3rd Oct 2006
<br>
<strong>Property type:</strong>
Semi-detached house |
<strong>Tenure:</strong>
Leasehold |
<strong>Previous sale:</strong> £36,000 |
<strong>Sale date:</strong>
26th Sep 2002
<br>
<strong>Property type:</strong>
Semi-detached house |
<strong>Tenure:</strong>
Leasehold |
<strong>Previous sale:</strong> £39,950 |
<strong>Sale date:</strong>
27th Jan 1995
<span class="new-build">New build</span>
</span>
| <a href="/for-sale/details/42175871"><i class="icon icon-home nolink"></i>Currently for sale</a>
</p>
</div>
我正在尝试抓取"最后销售"、"销售日期"和"当前待售"值的数据,但里面的所有内容除外
<span id="sold-prices" class="none">
我知道我能做到
html.search(".//div[@class='separate']")
在单独的div 中获取 HTML,但我不知道如何抓取我想要的标签的数据。有什么想法吗?
Nokogiri 完成 HTML 处理后,查找和操作节点真的很容易。有时这意味着有选择地删除节点以简化 DOM。这是其中一次:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="seperate">
<p>
<strong>Property type:</strong> Semi-detached house |
<strong>Tenure:</strong> Leasehold |
<strong>Last sale:</strong> £71,000 | <strong>Sale date:</strong> 5th Dec 2007 - <a href="" class="toggle_sold_prices">Previous sales</a>
<span id="sold-prices" class="none">
<br>
<strong>Property type:</strong>
Semi-detached house |
<strong>Tenure:</strong>
Leasehold |
</span>
</p>
</div>
EOT
doc.at('#sold-prices').remove
data = doc.search('strong').map{ |strong|
[strong.text, strong.next_sibling.text.tr('|', '').strip]
}.to_h
data # => {"Property type:"=>"Semi-detached house", "Tenure:"=>"Leasehold", "Last sale:"=>"£71,000", "Sale date:"=>"5th Dec 2007 -"}
诀窍是:
doc.at('#sold-prices').remove
摆脱了森林,所以你可以看到你想要的树木。
清理结果数据还需要一点点,但代码的其余部分应该是不言自明的,因此调整它对您来说应该很容易。