python漂亮汤html标签问题〔更新〕



我在md文件中有以下几行

<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>

我的问题是:

  1. 如何使用漂亮的汤检查TD后的下一行是html标签或文本
  2. 如何在带有ID的所有链接(html和md(中添加html commecnt

第二个问题的预期输出是

<td colspan="1" class="IDtd">
<p>
<!-- 
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> 
--> #ID - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<!--
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a> 
--> #ID
</td>
<!--
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
--> #ID

对于第一个问题,我发现了这个

html = """
<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://_jira_link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://_jira_link/jira/browse/EEEE-2543" class="external-link" rel="nofollow">https://_jira_link/browse/EEEEE-2543</a>
</td>
"""
soup = BeautifulSoup(html)
tds = soup.find_all("td", {"class":"IDtd"})
for td in tds:
p = td.find_all("p") # you get list
if p:
a = soup.find_all("a")
if a:
print("Anchor text is: " + a[0].get_text())
continue
print("P text is: " + p[0].get_text())
continue
else:
print("No P and A tags found")

感谢您在高级

您的第一个问题,如何找出某个标签后面的内容,可以通过使用next_element函数来完成,比如:

from bs4 import BeautifulSoup, Comment
html = """<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>"""
soup = BeautifulSoup(html, "html.parser")
element = soup.td
for _ in range(5):
element = element.next_element
print(type(element), element.name)

这将显示<td>标签后面的接下来五个元素的类型和名称:

<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> p
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> a
<class 'bs4.element.NavigableString'> None

正如您所看到的,下一个元素实际上是一个字符串(包含换行符(,然后是<p>标记。


对于第二个问题,您可以根据需要使用BeautifulSoup插入或提取标签。首先对所有需要的CCD_ 4标签进行迭代,则创建内容为CCD_ 6标签的CCD_。然后可以将其插入标记之前。最后移除现有的<a>标签:

from bs4 import BeautifulSoup, Comment
html = """<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>"""
soup = BeautifulSoup(html, "html.parser")
for td in soup.find_all('td', class_="IDtd"):
for a_tag in td.find_all('a'):
a_tag.insert_before(Comment(f'n{a_tag}n'))
a_tag.extract()
print(soup)

更新后的HTML将是:

<td class="IDtd" colspan="1">
<p>
<!--
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a>
--> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<!--
<a class="external-link" href="https://link/browse/EEEE-2543" rel="nofollow">https://link/browse/EEEEE-2543</a>
-->
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>

相关内容

  • 没有找到相关文章

最新更新