有没有办法使用xpath
和R(而不是PHP)从更长的地址字符串中只提取一部分(城市)?
以下是以下网页内容的相关部分:
http://www.kentmcbride.com/offices/
<table id="offices" cellspacing="8" width="700" height="100" border="0">
<tbody>
<tr>
<td valign="top">
<h2>
<img width="122" height="22" src="/_common/sub_philadelphia.png">
</h2>
<p>
1617 JFK Boulevard
<br>
Suite 1200
<br>
Philadelphia, PA 19103
</p>
</td>
<td valign="top">
<td valign="top">
</tr>
解析内容并使用xpath
表达式,R返回整个字符串地址(省略剩余部分),但我只想要城市(在查看返回的内容之前,我不知道城市)。
require(XML)
doc <- htmlTreeParse('http://www.kentmcbride.com/offices/', useInternal = TRUE)
xpathSApply(doc, "//table[@id = 'offices']//p", xmlValue, trim = TRUE)
[1] "1617 JFK Boulevardn Suite 1200n Philadelphia, PA 19103"
[2] "1040 Kings Highway Northn Suite 600n Cherry Hill, NJ 08034"
[3] "824 North Market Streetn Suite 805 n Wilmington, DE 19801"
前面的一个问题假设我知道城市名称;我没有。XPath-如何从一个文本节点中提取文本的特定部分
有没有办法只获得城市?
如果我们可以假设"city"是最后一行,那么您可以选择<br>
节点之后的最后一个文本节点。所以在xpath中,这将是
text()[preceding-sibling::br][last()]
即前面有一个br
节点的文本节点,然后我们只需要其中的最后一个:
require(XML)
doc <- htmlTreeParse('http://www.kentmcbride.com/offices/', useInternal = TRUE)
xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]")
> xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]")
[[1]]
Philadelphia, PA 19103
[[2]]
Cherry Hill, NJ 08034
[[3]]
Wilmington, DE 19801
[[4]]
Blue Bell, PA 19422
[[5]]
Iselin, NJ 08830
[[6]]
New York, NY 10170
[[7]]
Pittsburgh, PA 15222
@jdharrison做了XPath的艰苦工作(也就是说,他的回答值得称赞)。这个额外的位(不能只使用XPath)占据了整个城市:
require(stringr)
unlist(lapply(xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]", xmlValue), function(x) {
str_match(x, "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}))
## [1] "Philadelphia" "Cherry Hill" "Wilmington" "Blue Bell" "Iselin" "New York" "Pittsburgh"
建议编辑:
xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]"
, function(x){
str_match(xmlValue(x), "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}
)
事实上,这是个不错的主意。事实上,自从dplyr
问世并完全取消了匿名功能以来,我应该坚持使用一个我一直在尝试的新习惯用法:
# to be used in xpathSApply below
extractCity <- function(last_line) {
str_match(xmlValue(last_line), "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}
xpathSApply(doc,
"//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]",
extractCity)