如何在特定标题之后提取HTML表格?

我正在使用BeautifulSoup来解析HTML文件。我有一个类似于这样的 HTML 文件：

<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>

<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>

<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>

<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>

我想提取字符串"我想要这个字符串"。完美的解决方案是获取h3标题之后的第一个表，称为"好东西"。我不知道如何使用 BeautifulSoup 执行此操作 - 我只知道如何提取具有特定类的表，或者嵌套在某个特定标签中的表，但不遵循特定标签。

我认为回退解决方案可以使用字符串"Key C"，假设它是唯一的(几乎可以肯定是(，并且只出现在一个表中，但我选择特定的 h3 标题会感觉更好。

按照@Zroq对另一个问题的回答逻辑，这段代码将在您定义的标题("好东西"(之后为您提供表格。请注意，我只是将所有 html 放在名为"html"的变量中。

from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)

输出：

<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>

干杯！

文档解释说，如果您不想使用find_all，您可以这样做：

for sibling in soup.a.next_siblings:
print(repr(sibling))

我相信有很多方法可以更有效地做到这一点，但这是我现在可以考虑的：

from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'

输出：

I WANT THIS STRING

相关内容

最新更新

热门标签：