如何使用BeautifulSoup通过类名找到父元素?

如何使用BeautifulSoup正确访问标记的父元素?我有以下结构:

<div class="TableContainer">
<div class="CaptionContainer">
<div class="CaptionInnerContainer">
<!-- There is N other span elements here -->
<span class="CaptionVerticalLeft"></span>
<div class="Text">Character Information</div> <!-- Content that I am filtering -->
<span class="CaptionVerticalRight"></span>
<!-- There is N other span elements here -->
</div>
</div>
<table class="Table3" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td>
<div class="InnerTableContainer">
<table style="width: 100%">
<!-- Content that I want -->
</table>
</div>
</td>
</tr>
</tbody>
</table>
</div>

网站上有这样的N结构，为了获得正确的信息，我首先通过文本"字符信息"过滤，然后我想获得类TableContainer的第一个父级，所以在此之后，我将能够找到具有我想要的内容的table。

当前代码返回None当我尝试find_parent

soup = BeautifulSoup(page.content, "html.parser")
char_title = soup.find("div", string="Character Information")
# result is: <div class="Text">Character Information</div>
parent = char_title.find_parent("div", {"class": "TableContainer"})
# result is: None
parent = char_title.find_parent("div", _class="TableContainer")
# result is: None

如何找到具有特定类的父类?

首先找到每个块的包含div。然后搜索标题，以确定它是否是你想要的块。如果是，则使用相同的搜索来定位块内所需的表。例如:

import requests
from bs4 import BeautifulSoup
from unicodedata import normalize
url = "https://www.tibia.com/community/?name=Rubini"
req = requests.get(url)
soup = BeautifulSoup(req.content, "lxml")
for div in soup.find_all('div', class_="TableContainer"):
title = div.find('div', class_="Text").text
if title == "Character Information":
table_inner = div.find('table', class_="Table3").table

for tr in table_inner.find_all('tr'):
row = [normalize('NFKD', td.get_text(strip=True)) for td in tr.find_all('td')]
print(row)

输出:

['Name:RubiniTitle:Exalted (21 titles unlocked)Sex:maleVocation:Elder DruidLevel:1305Achievement Points:411World:LibertabraResidence:ThaisGuild Membership:Leader of theLibertabra PuneLast Login:May 18 2022, 22:17:16 CESTComment:"Frase motivacional de Lideres antigos de Guerras"First Level 1000 on RetroHardcore Servers.Rubini formerly Shanera and Calvera@Since 2007836a95f1d69460734766489384f641cbAccount Status:Free Account', 'Name:', 'Rubini', 'Title:', 'Exalted (21 titles unlocked)', 'Sex:', 'male', 'Vocation:', 'Elder Druid', 'Level:', '1305', 'Achievement Points:', '411', 'World:', 'Libertabra', 'Residence:', 'Thais', 'Guild Membership:', 'Leader of theLibertabra Pune', 'Last Login:', 'May 18 2022, 22:17:16 CEST', 'Comment:', '"Frase motivacional de Lideres antigos de Guerras"First Level 1000 on RetroHardcore Servers.Rubini formerly Shanera and Calvera@Since 2007836a95f1d69460734766489384f641cb', 'Account Status:', 'Free Account']
['Name:', 'Rubini']
['Title:', 'Exalted (21 titles unlocked)']
['Sex:', 'male']
['Vocation:', 'Elder Druid']
['Level:', '1305']
['Achievement Points:', '411']
['World:', 'Libertabra']
['Residence:', 'Thais']
['Guild Membership:', 'Leader of theLibertabra Pune']
['Last Login:', 'May 18 2022, 22:17:16 CEST']
['Comment:', '"Frase motivacional de Lideres antigos de Guerras"First Level 1000 on RetroHardcore Servers.Rubini formerly Shanera and Calvera@Since 2007836a95f1d69460734766489384f641cb']
['Account Status:', 'Free Account']

文本使用unicodedata.normalize()

进行规范化

相关内容

最新更新

热门标签：