用Nokogiri解析标签之间的HTML



我的HTML文件是这样的:

<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
<a href='http://crossfitpentagon.com/' target='_blank'>CrossFit Pentagon</a> - Washington,&nbsp;DC<br />
<a href='http://crossfitwtbn.blogspot.com/' target='_blank'>CrossFit WTBN</a> - Quantico,&nbsp;VA<br />
<a href='http://cfnewriver.blogspot.com/' target='_blank'>CrossFit New River</a> - Jacksonville,&nbsp;NC<br />
<a href='http://xfitmiramar.com' target='_blank'>CrossFit Miramar</a> - San Diego,&nbsp;CA<br />
<a href='http://www.crossfitfortmeade.com/' target='_blank'>CrossFit Fort Meade</a> - Odenton,&nbsp;MD<br />

我能够提取链接内容/副本和URL,但我还需要提取</a>结束和下一个<a>开始之间的信息,无论<br />之前是什么。例如,在第一行中我需要提取"Quantico,&nbsp;VA"

这是我的代码的一部分,我提取了我需要的部分信息:下面是我到目前为止所做的(一旦我得到page对象,我将有一个循环来运行每一行html源代码,以便提取我需要的所有数据):

page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm")) 
if page.text != ""
    ## Get the URL and Name
    if page.css("a")[i] != nil
        name = page.css("a")[i].text
    else
        name = 'NA'
    end
    if page.css("a")[i] != nil
        url = page.css("a")[i]["href"]
    else
        url = 'NA'
    end
end if

通读XML::Node和XML::NodeSet文档。可用的方法使导航和提取节点成为可能:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
<a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
</body>
</html>
EOT
data = doc.search('a').map{ |link|
  link_href = link['href']
  link_text = link.text
  trailing_text = link.next_sibling.text
  {
    href: link_href,
    text: link_text,
    trailing_text: trailing_text
  }
}

data将包含:

data 
# => [{:href=>"http://crossfitquantico.blogspot.com/",
#      :text=>"CrossFit Quantico",
#      :trailing_text=>" - Quantico,u00A0VA"},
#     {:href=>"http://www.crossfitcherrypoint.com",
#      :text=>"CrossFit Cherry Point",
#      :trailing_text=>" - Havelock,u00A0NC"}]

不要这样做:

page = Nokogiri::HTML(open("http://www.crossfit.com/cf-info/main_affil.htm")) 
if page.text != ""
    ## Get the URL and Name
    if page.css("a")[i] != nil
        name = page.css("a")[i].text
    else
        name = 'NA'
    end
    if page.css("a")[i] != nil
        url = page.css("a")[i]["href"]
    else
        url = 'NA'
    end
end if

if page.text != ""并没有真正告诉你你想知道的,即是否有链接。只要在文档中搜索一下就会知道。

每次使用page.css("a")时都在DOM中搜索链接,这会浪费CPU。测试page.css("a")[i] != nil也是浪费。如果你迭代一个语法正确的文档,正确地包含链接,你永远不会有找不到链接的情况,因为search或它的类似行为会把它们交给你。

这里对上面的代码做了一个小的调整,以提供"NA"值:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <a href='http://crossfitquantico.blogspot.com/' target='_blank'>CrossFit Quantico</a> - Quantico,&nbsp;VA<br />
    <a href='http://www.crossfitcherrypoint.com' target='_blank'>CrossFit Cherry Point</a> - Havelock,&nbsp;NC<br />
    <a ></a>  
  </body>
</html>
EOT
doc.search('a').class # => Nokogiri::XML::NodeSet
doc.search('a').size # => 3
data = doc.search('a').map{ |link|
  link_href = link['href']
  link_text = link.text
  trailing_text = link.next_sibling.text
  {
    href: link_href || 'NA',
    text: link_text.empty? ? 'NA' : link_text,
    trailing_text: trailing_text
  }
}
data.size # => 3
data.class # => Array
data.first.class # => Hash
data 
# => [{:href=>"http://crossfitquantico.blogspot.com/",
#      :text=>"CrossFit Quantico",
#      :trailing_text=>" - Quantico,u00A0VA"},
#     {:href=>"http://www.crossfitcherrypoint.com",
#      :text=>"CrossFit Cherry Point",
#      :trailing_text=>" - Havelock,u00A0NC"},
#     {:href=>"NA", :text=>"NA", :trailing_text=>"  n  "}]

相关内容

  • 没有找到相关文章

最新更新