使用xpath和R，如何只提取字符串不一致的文本字符串的一部分

有没有办法使用xpath和R（而不是PHP）从更长的地址字符串中只提取一部分（城市）？

以下是以下网页内容的相关部分：

http://www.kentmcbride.com/offices/

<table id="offices" cellspacing="8" width="700" height="100" border="0">
<tbody>
<tr>
<td valign="top">
<h2>
<img width="122" height="22" src="/_common/sub_philadelphia.png">
</h2>
<p>
1617 JFK Boulevard
<br>
Suite 1200
<br>
Philadelphia, PA 19103
</p>
</td>
<td valign="top">
<td valign="top">
</tr>

解析内容并使用xpath表达式，R返回整个字符串地址（省略剩余部分），但我只想要城市（在查看返回的内容之前，我不知道城市）。

require(XML)
doc <- htmlTreeParse('http://www.kentmcbride.com/offices/', useInternal = TRUE)
xpathSApply(doc, "//table[@id = 'offices']//p", xmlValue, trim = TRUE)
[1] "1617 JFK Boulevardn                Suite 1200n                Philadelphia, PA 19103"                        
[2] "1040 Kings Highway Northn                Suite 600n                Cherry Hill, NJ 08034"                    
[3] "824 North Market Streetn                Suite 805 n                Wilmington, DE 19801"

前面的一个问题假设我知道城市名称；我没有。XPath-如何从一个文本节点中提取文本的特定部分

有没有办法只获得城市？

如果我们可以假设"city"是最后一行，那么您可以选择<br>节点之后的最后一个文本节点。所以在xpath中，这将是

text()[preceding-sibling::br][last()]

即前面有一个br节点的文本节点，然后我们只需要其中的最后一个：

require(XML)
doc <- htmlTreeParse('http://www.kentmcbride.com/offices/', useInternal = TRUE)
xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]")
> xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]")
[[1]]
                Philadelphia, PA 19103               
[[2]]
                Cherry Hill, NJ 08034 
[[3]]
                Wilmington, DE 19801 
[[4]]
                Blue Bell, PA 19422

[[5]]
                Iselin, NJ 08830 
[[6]]
                New York, NY 10170 
[[7]]
              Pittsburgh, PA 15222

@jdharrison做了XPath的艰苦工作（也就是说，他的回答值得称赞）。这个额外的位（不能只使用XPath）占据了整个城市：

require(stringr)
unlist(lapply(xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]", xmlValue), function(x) {
  str_match(x, "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}))
## [1] "Philadelphia" "Cherry Hill"  "Wilmington"   "Blue Bell"    "Iselin"       "New York"     "Pittsburgh"

建议编辑：

xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]"
            , function(x){
              str_match(xmlValue(x), "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
            }
)

事实上，这是个不错的主意。事实上，自从dplyr问世并完全取消了匿名功能以来，我应该坚持使用一个我一直在尝试的新习惯用法：

# to be used in xpathSApply below
extractCity <- function(last_line) {
  str_match(xmlValue(last_line), "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}
xpathSApply(doc, 
            "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]", 
            extractCity)

相关内容

最新更新

热门标签：