如何在 Ruby 中使用 Nokogiri 解析日期

我正在尝试解析此页面并提取之后开始的日期

>p>From Date:

我收到错误

Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)

来自"检查元素"的路径是

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

下面是代码的一个示例：

#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end

这是 file://china.html


    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        
        <title>File </title>
    
      </head>
      <body>
        
            <div id ="timelineItems">
    <H2 id="telegram1"> Title </H2>
            <p><table cellspacing="0">
    <tr>
    <td width="2%">&nbsp;</td>
    <td width="75%">
    <table cellspacing="0" cellpadding="0" class="resultsTypes">
    <tr>
    <td width="5%" class="hide">&nbsp;</td>
    <td width="70%">
    <p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
    <p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>
    <p>recipient: David Ben Gurion</p>
    <p>sender: Prime Minister of Union of Burma, Rangoon</p>
    <p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
    <p>From Date: 02/14/1936</p>
    <p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
    </td>
    </tr>
    <tr>
    <td colspan="2">
    </td>
    </tr>
    </table></td>
    <td class="actions">&nbsp;</td>
    </tr>
    </table>
    </p>
          </div>
          
    
    </body></html>
阿马丹的回答原创.rb
 
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text()
puts date
formatted = date[/From Date: (.*)/, 1]
puts formatted
给出错误original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)


 
    你不能使用
 noko = Nokogiri::HTML('china.html')
 Nokogiri::HTML是Nokogiri::HTML::Document.parse的快捷方式。文档说：
 .parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object`
  。 string_or_io可以是字符串，也可以是响应读取和关闭的任何对象，例如 IO 或 StringIO。...
虽然'china.html'是一个字符串，但它不是 HTML。看起来您认为文件名就足够了，但是Nokogiri不会打开任何内容，它只能理解包含标记(HTML或XML(的字符串，或者响应read方法的IO类型对象。比较这些：
 require 'nokogiri'
doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">n<html><body><p>china.html</p></body></html>n"
对：
 doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">n<html><body><p>foo</p></body></html>n"
和：
 doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>n<html>n<head>n    <title>Example Domain</title>nn    <meta charset="utf-8">n    <met"
最后一个有效是因为OpenURI增加了读取URL到open的功能，它响应read：
 open('http://www.example.org').respond_to?(:read) # => true
继续提问：
 require 'nokogiri'
require 'open-uri'
html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>File </title>

  </head>
  <body>
        <div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
        <p><table cellspacing="0">
<tr>
<td width="2%">&nbsp;</td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide">&nbsp;</td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions">&nbsp;</td>
</tr>
</table>
</p>
      </div>

</body></html>
EOT
doc = Nokogiri::HTML(html)
解析文档后<p>，
使用 <table cellspacing="0" cellpadding="0" class="resultsTypes">
作为地标：
 from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"
  看起来拉标题=">在中国会面"和链接="bing.com"会更难;因为它们在同一条线上。 
我正在使用CSS选择器来定义所需文本的路径。CSS比XPath更容易阅读，尽管XPath更强大，更具描述性。Nokogiri 允许我们使用其中任何一个，并允许我们使用search或at。 at相当于search('some selector').first。还有 CSS 和 XPath 特定版本的 search 和 at，如 Nokogiri::XML::Node 中所述。
 title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"
您正在尝试使用 XPath：
 /html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
但是，它对您正在使用的 HTML 无效。
请注意选择器中的tbody。查看 HTML，紧跟在 <table> 标记中的任何一个之后，这两个匹配项都没有 <tbody> 标记，因此 XPath 是错误的。我怀疑这是由您的浏览器生成的，该浏览器正在根据规范对 HTML 进行修复以添加<tbody>，但是 Nokogiri 没有进行修复以添加<tbody>并且 HTML 不匹配，导致搜索失败。因此，不要依赖浏览器定义的选择器，也不应该相信浏览器对实际HTML源代码的想法。
与其使用显式选择器，不如在标记中查找特定的航点，并使用这些航点导航到所需的节点，这更好、更容易、更智能。下面是仅使用占位符以及 XPath 和 CSS 混合执行上述所有操作的示例：
 doc.at('//p[starts-with(., "Title:")]').text  # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"
因此，混合搭配CSS和XPath是可以的。
  from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"
编辑：
说明：获取文档(//(中任何位置的第一个节点(#at_xpath(，使得([...](文本内容(text()(以(starts-with(string, stringStart)("From Date"("From Date:"(开头，并将其文本内容(#text()(存储(=(到变量from_date(from_date(。然后，使用与文字字符"From Date: "匹配的正则表达式(/.../(从该文本(from_date(中提取第一组(#[regexp, 1](，后跟任何字符(.(的任意数字(#[regexp, 1] *(，这些字符将被捕获((...)(到第一个要提取的捕获组中。
也
阿马丹的回答[...]给出了一个错误
我没有注意到你的Nokogiri结构被打破了，正如铁皮人所解释的那样。行noko = Nokogiri::HTML('china.html')(这不是我答案的一部分(将为您提供一个单节点文档，其中只有文本"china.html"，根本没有<p>节点。

相关内容

最新更新

热门标签：