小贝子编程

r-使用xpathapply只获取webscrapes向量中每行的第一个h2节点

本文关键字：第一个节点 h2 向量 xpathapply 使用获取 webscrapes xpath web-scraping sapply
更新时间 : 2023-08-21
英文 : r - Using xpathapply to take only the first h2 node of each row in vector of webscrapes

我正试图解析一个（大）的scraped html向量，更具体地说是，但向量中的一些页面中有两个，因此替换的行数最终比数据多。我的问题是：如何在每个obs中只取第一个//h2？

这是我尝试过的代码：

data$header = unlist(xpathSApply(htmlParse(data$html, asText=TRUE), '(//h2)[1]', xmlValue))

这只给了我第一次机会。这个代码给了我所有的h2s:

data$header = xpathApply(htmlParse(philly$html, asText=TRUE), '//descendant::h2[1]', xmlValue)

感谢提供的任何帮助

两个样品：

<div id="tutors">
 <h1>Tutors</h1>
<div class="tutor">
<h2>John</h2>
 <p>...</p>

<div class="tutor">
<h2>Mary</h2>
<p>...</p>
</div>
<div class="tutor">
<h2>David</h2>
<p>...</p>
</div>
</div>

通过包含整个根路径来解决此问题：

data$header = unlist(xpathApply(htmlParse(data$html, asText=TRUE), '/html/body/h2', xmlValue))

相关内容