如何使用BeautifulSoup通过类名找到父元素?



如何使用BeautifulSoup正确访问标记的父元素?我有以下结构:

<div class="TableContainer">
<div class="CaptionContainer">
<div class="CaptionInnerContainer">
<!-- There is N other span elements here -->
<span class="CaptionVerticalLeft"></span>
<div class="Text">Character Information</div> <!-- Content that I am filtering -->
<span class="CaptionVerticalRight"></span>
<!-- There is N other span elements here -->
</div>
</div>
<table class="Table3" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td>
<div class="InnerTableContainer">
<table style="width: 100%">
<!-- Content that I want -->
</table>
</div>
</td>
</tr>
</tbody>
</table>
</div>

网站上有这样的N结构,为了获得正确的信息,我首先通过文本"字符信息"过滤,然后我想获得类TableContainer的第一个父级,所以在此之后,我将能够找到具有我想要的内容的table

当前代码返回None当我尝试find_parent

soup = BeautifulSoup(page.content, "html.parser")
char_title = soup.find("div", string="Character Information")
# result is: <div class="Text">Character Information</div>
parent = char_title.find_parent("div", {"class": "TableContainer"})
# result is: None
parent = char_title.find_parent("div", _class="TableContainer")
# result is: None

如何找到具有特定类的父类?

首先找到每个块的包含div。然后搜索标题,以确定它是否是你想要的块。如果是,则使用相同的搜索来定位块内所需的表。例如:

import requests
from bs4 import BeautifulSoup
from unicodedata import normalize
url = "https://www.tibia.com/community/?name=Rubini"
req = requests.get(url)
soup = BeautifulSoup(req.content, "lxml")
for div in soup.find_all('div', class_="TableContainer"):
title = div.find('div', class_="Text").text
if title == "Character Information":
table_inner = div.find('table', class_="Table3").table

for tr in table_inner.find_all('tr'):
row = [normalize('NFKD', td.get_text(strip=True)) for td in tr.find_all('td')]
print(row)

输出:

['Name:RubiniTitle:Exalted (21 titles unlocked)Sex:maleVocation:Elder DruidLevel:1305Achievement Points:411World:LibertabraResidence:ThaisGuild Membership:Leader of theLibertabra PuneLast Login:May 18 2022, 22:17:16 CESTComment:"Frase motivacional de Lideres antigos de Guerras"First Level 1000 on RetroHardcore Servers.Rubini formerly Shanera and Calvera@Since 2007836a95f1d69460734766489384f641cbAccount Status:Free Account', 'Name:', 'Rubini', 'Title:', 'Exalted (21 titles unlocked)', 'Sex:', 'male', 'Vocation:', 'Elder Druid', 'Level:', '1305', 'Achievement Points:', '411', 'World:', 'Libertabra', 'Residence:', 'Thais', 'Guild Membership:', 'Leader of theLibertabra Pune', 'Last Login:', 'May 18 2022, 22:17:16 CEST', 'Comment:', '"Frase motivacional de Lideres antigos de Guerras"First Level 1000 on RetroHardcore Servers.Rubini formerly Shanera and Calvera@Since 2007836a95f1d69460734766489384f641cb', 'Account Status:', 'Free Account']
['Name:', 'Rubini']
['Title:', 'Exalted (21 titles unlocked)']
['Sex:', 'male']
['Vocation:', 'Elder Druid']
['Level:', '1305']
['Achievement Points:', '411']
['World:', 'Libertabra']
['Residence:', 'Thais']
['Guild Membership:', 'Leader of theLibertabra Pune']
['Last Login:', 'May 18 2022, 22:17:16 CEST']
['Comment:', '"Frase motivacional de Lideres antigos de Guerras"First Level 1000 on RetroHardcore Servers.Rubini formerly Shanera and Calvera@Since 2007836a95f1d69460734766489384f641cb']
['Account Status:', 'Free Account']

文本使用unicodedata.normalize()

进行规范化

相关内容

  • 没有找到相关文章