提取 标签之间的文本 - Extract text between <br

要提取URL，我使用以下方法：

html = open('http://lab/links.html')
urls = URI.extract(html)

这很管用。

现在，我需要提取一个不带前缀http或https的URL列表，它们位于 标记之间。由于没有http或https标记，URI.extract不起作用。

domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php

每个未固定的URL都位于 标记之间。

~~我一直在看这个Nokogiri Xpath来检索 在<TD>并且但无法使其发挥作用~~

输出

domain1.com/index.html
domain2.com/home/~john/index.html
domain3.com/a/b/c/d/index.php

~~中间解决方案~~

doc = Nokogiri::HTML(open("http://lab/noprefix_domains.html"))
doc.search('br').each do |n|
  n.replace("n")
end
puts doc

我仍然需要去掉其余的HTML标记（!DOCTYPE, html, body, p）

解决方案

str = ""
doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") }
puts str.split /s*<s*brs*>s*/

谢谢。

假设您已经有了提取问题中显示的示例字符串的方法，那么您可以在字符串上使用split：

str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php"
str.split /s*<s*brs*>s*/
#=> ["domain1.com/index.html", 
#    "domain2.com/home/~john/index.html",
#    "domain3.com/a/b/c/d/index.php"]

这将在每个 标记处拆分字符串。它还将去除 之前和之后的空白，并允许在 标签（例如 或 ）内存在空白。如果您也需要处理自关闭标记（例如 ），请使用以下正则表达式：

/s*<s*brs*/?s*>s*/

提取 <br >标签之间的文本

相关内容

最新更新

热门标签：