使用selenium python通过html标签过滤提取的文本

我必须使用如下的html结构：

<div class='divClass'>
<h5>Article 1</h5>
<p>Paragraph one written in 2022</p>
<p>(1) This <sup>1</sup>paragraph <sup>2</sup>has footnotes.</p>
<p>This paragraph has a different <a class='footnotelink'>3</a>footnote.</p>
</div>

我需要从这个div中提取文本，但脚注必须被过滤掉。

以下是关于结构的更多细节：

可以有0个或多个<p>标记
每个<p>标记可以包含也可以不包含任何类型的脚注
每个<p>标签可以包含不应删除的所需数字
<h5>可替换为<h4>
脚注可以在<sup>标签中，也可以在类别为'footnotelink'的<a>标签中

如果我使用driver.find_element(By.CLASS_NAME, 'divClass').text，我会收到未过滤的版本，如下所示：

Article 1nParagraph one written in 2022n(1) This 1paragraph 2has footnotes.nThis paragraph has a different 3footnote.

我需要的是：

Article 1nParagraph one written in 2022n(1) This paragraph has footnotes.nThis paragraph has a different footnote.

我不能简单地过滤掉数字，因为它们可能出现在脚注之外的文本中。

这个问题类似，但过滤掉所有文本节点的文本，而不是只过滤特定的文本节点。

编辑：指定<p>标签可以包含所需的数字

这里可以做的是：

获取整个文本
使用footnotelink类名文本获取包含sup或a的p元素
从前者中删除数字
从整个文本中，用步骤3中接收的文本替换步骤2中接收的文本，如下所示：

entire_text = driver.find_element(By.CLASS_NAME, 'divClass').text
psups = driver.find_element(By.XPATH, "//div[@class='divClass']//p[.//sup]")
pas = driver.find_element(By.XPATH, "//div[@class='divClass']//p[.//a[@class='footnotelink']]")
sub_texts = []
for ps in psups:
sub_texts.append(ps.text)
for pa in pas:
sub_texts.append(pa.text)
sut_text_cleaned = []
for sub in sub_texts:
res = ''.join([i for i in sub if not i.isdigit()])
sut_text_cleaned.append(res)
for i in range(len(sub_texts)):
entire_text.replace(sub_texts[i], sut_text_cleaned[0])

使用regex

import re
article = driver.find_element(By.CLASS_NAME, 'divClass').text
article = re.sub(r'd{1,}footnote', 'footnote', article)
print(article)

CCD_ 15表示刚好在CCD_ 16之前的多个(一个或多个(数字。

相关内容

最新更新

热门标签：