我有纯html文档没有CSS。其中一些内容,我需要通过excel表。我试过Nokogiri,它在Css的基础上工作。
有人试过吗?
<html>
<head></head>
<body>
***NOTE***
<br>
Items
<br>
<br>
Invoice Number : [78945824] PO Number : [4587958]
<br>
Track It : <a href="abc.com"> 12345</a>
<br>
<br>
Items
<br>
<br>
Invoice Number : [79546828] PO Number : [4567892]
<br>
<br>
<br>
Items
<br>
<br>
Invoice Number : [78976824] PO Number : [897569]
<br>
Track It : <a href="abc.com"> 12345</a>
<br>
</body>
</html>
我能够检索PO号&跟踪没有
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
data = page.css("body").text
po_numbers = data.scan(/Invoice Number : [d+] PO Number : [(d+)]/).flatten
tracking_numbers = page.css("a").text.split
[["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
puts po_numbers
puts tracking_numbers
=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]
当我们把这些压缩在一起,我们得到:
=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
试试这个
data = page.css("body").text
data = data.gsub(" ","").split(/n/)
po=[]
track=[]
data.each do |i|
if i.include? "PONumber"
po << i.split("PONumber:").last.scan(/d+/)[0]
end
if i.include? "TrackIt"
track << i.split("TrackIt:").last
end
end
po.zip(track)
如果您可以使用regex扫描所有发票号(po_numbers),您可以对跟踪号(tracking_numbers):
tracking_numbers = data.scan(/Tracking no : (d*)/).flatten
返回的数组包括nil,因此,您可以遍历两个数组以查找po号和跟踪号
po_numbers.each_with_index do |elm, index|
p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end
这个正则表达式匹配更新后的HTML
/Track It :s*(?:<a href=".*">s*(d+)s*</a>|$)/
它匹配空曲目号和一个带有链接的曲目。