遇到标签时在野光吉里拆分内容

鉴于HTML页面的以下部分，我希望能够将"us"和"John"视为单独的。

            <div id="ad-description" class="c-clear c-word-wrap">
Room for rent in Sydney.
<br/><br/>For more information please contact us<br/>John :- 0491 570 156<br/>Jane :- (02) 5550 1234</div>
    <!-- google_ad_section_end(name=description) -->
        </div>

当使用 Nokogiri 访问广告描述节点，然后在该节点上调用 content 时，我会得到usJohn作为结果字符串的一部分：

document = Nokogiri::HTML(text)
ad_description_xpath = './/div[contains(@id, "ad-description")]'
ad_description_nodes = document.xpath(ad_description_xpath)
ad_description_node = ad_description_nodes.first
ad_description_node.content # "...please contact usJohn :- ..."

我怎样才能让 Nokogiri 在"我们"和"约翰"之间返回带有某种空格的字符串，或者将"我们"和"约翰"放在单独的字符串中？

理想情况下，所采用的方法将能够处理节点内的任何标签，而无需我编写的代码提及特定标签。

text()节点选择器将选择文本节点，这将在其自己的节点中为您提供每个文本部分。然后，您可以使用 map 来获取字符串数组：

document = Nokogiri::HTML(text)
# Note text() added to end of XPath here:
ad_description_nodes = document.xpath('.//div[contains(@id, "ad-description")]/text()'
strings = ad_description_nodes.map &:content

使用示例数据，strings现在将如下所示：

["nnRoom for rent in Sydney.n", "For more information please contact us", "John :- 0491 570 156", "Jane :- (02) 5550 1234"]

如您所见，您可能会得到一些额外的前导或尾随空格，以及可能一些仅由空格组成的节点，因此您可能需要更多处理。此外，这会错过任何不是div 直接子级的文本，例如，如果 strong 或 em 标签中有一些文本。如果有可能，您可以使用//text()而不是/text()。

可以调用#children来获取ad_description_node的子节点，然后用text?过滤文本节点。这样，您将在ad_description_node内拥有一个文本节点数组：

ad_description_node.children.select( &:text? ).map( &:content )
# [
#   [0] "nn  Room for rent in Sydney.n  ",
#   [1] "For more information please contact us",
#   [2] "John :- 0491 570 156",
#   [3] "Jane :- (02) 5550 1234"
# ]

相关内容

最新更新

热门标签：