r语言 - 如何获取 HTML 元素，考虑另一个标签的后续内容而不是类? - r - How to get HTML element considering later content of another tag and not the class? 小贝子编程网

我正在将HTML转换为美观整洁的CSV。我有一个文件，里面装满了表格，而且课程很少。我有三种类型的表，它们的结构是相同的。唯一的区别是"th"元素中的内容，它位于我感兴趣的元素之后。如何仅获取在"th"("text_that_I_want_to_get"(中包含某些文本的表的内容？有没有办法在每种类型的表中插入一个带有 R 的类？

表的类型 1

<tr>
<th class="array">text_that_I_want_to_get</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string">name</th>
<th class="string">mean</th>
<th class="string">stdev</th>
</tr>
</thead>
<tbody>

表的类型 2

<tr>
<th class="array">text_that_I_want_to_get</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string">name</th>
<th class="array">answers</th>
</tr>
</thead>
<tbody>

表的类型 3

<tr>
<th class="array">text_that_I_want_to_get</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string">Reference</th>
</tr>
</thead>
<tbody>

您需要以下三个 xpath：

xpath1 <- "//td[table[./thead/tr/th/text() = 'stdev']]/preceding-sibling::th"
xpath2 <- "//td[table[./thead/tr/th/text() = 'answers']]/preceding-sibling::th"
xpath3 <- "//td[table[./thead/tr/th/text() = 'Reference']]/preceding-sibling::th"

它们查找位于三种表类型中每种表的根目录中的td节点，然后找到具有所需文本的前th同级。

因此，要获得表类型 1 的"text_that_I_want_to_get"，您需要执行以下操作：

read_html(url) %>% html_nodes(xpath = xpath1) %>% html_text()
#> [1] "text_that_I_want_to_get"

您可以对xpath2和xpath3执行相同的操作，以从表类型 2 和表类型 3 中获取文本。

r语言 - 如何获取 HTML 元素，考虑另一个标签的后续内容而不是类?

相关内容

最新更新

热门标签：