ruby on rails - Nokogiri解析缺少元素会产生问题



我有纯html文档没有CSS。其中一些内容,我需要通过excel表。我试过Nokogiri,它在Css的基础上工作。

有人试过吗?

<html>
 <head></head>
  <body>
    ***NOTE***
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [78945824] PO Number : [4587958]
   <br>
       Track It : <a href="abc.com"> 12345</a>
   <br>
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [79546828] PO Number : [4567892]
   <br>
   <br>
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [78976824] PO Number : [897569]
   <br>
      Track It : <a href="abc.com"> 12345</a>
   <br>
   </body>
   </html>

我能够检索PO号&跟踪没有

  require 'rubygems'
require 'nokogiri'   
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
    data = page.css("body").text
    po_numbers = data.scan(/Invoice Number : [d+] PO Number : [(d+)]/).flatten
    tracking_numbers = page.css("a").text.split
    [["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
 puts po_numbers
 puts tracking_numbers

=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]

当我们把这些压缩在一起,我们得到:

=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]

试试这个

data = page.css("body").text
data = data.gsub(" ","").split(/n/)
po=[]
track=[]
data.each do |i|
  if i.include? "PONumber"
    po << i.split("PONumber:").last.scan(/d+/)[0]
  end
  if i.include? "TrackIt"
    track << i.split("TrackIt:").last
  end
end
po.zip(track)

如果您可以使用regex扫描所有发票号(po_numbers),您可以对跟踪号(tracking_numbers):

tracking_numbers = data.scan(/Tracking no : (d*)/).flatten

返回的数组包括nil,因此,您可以遍历两个数组以查找po号和跟踪号

po_numbers.each_with_index do |elm, index| 
  p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end

这个正则表达式匹配更新后的HTML

/Track It :s*(?:<a href=".*">s*(d+)s*</a>|$)/

它匹配空曲目号和一个带有链接的曲目。

相关内容

  • 没有找到相关文章

最新更新