获取表格单元格的内容,但不移除其中的标签



Python初学者在这里,我试图创建一个脚本来从表中检索数据并将其组织在字典中。

HTML的结构是这样的:

[...previous code]
<table class="waffle" cellspacing="0" cellpadding="0">
<tbody>
<tr style='height:39px;'>
<td class="s0" dir="ltr">Gold</td>
<td class="freezebar-cell"></td>
<td class="s1" dir="ltr">Johnny <span style="bold">M.</span></td>
</tr>
<tr style='height:39px;'>
<td class="s0" dir="ltr">Silver</td>
<td class="freezebar-cell"></td>
<td class="s1" dir="ltr">Maria <span style="bold">R.</span></td>
</tr>
[rest of the code...]

我当前的脚本是这样的:

from bs4 import BeautifulSoup
itemTypeList = [] # Create list of item types
itemContentList = [] # Create list of item contents
soup = BeautifulSoup(open("test/myfile.html"), "lxml") # Open the file
table_body = soup.find("tbody") # Find the table
rows = table_body.find_all("tr") # Find the rows
for row in rows: # For each row
itemType = row.find_all("td")[0].text # Define the first cell as item type
itemContent = row.find_all("td")[2] # Define the third cell as item content
itemTypeList.append(itemType) # Add item type to the item types list
itemContentList.append(itemContent) # Add item content to the item contents list
mailContent = {itemTypeList[i]: itemContentList[i] for i in range(len(itemTypeList))} # Create a dictionary with type and content for each item
下面是我使用这个脚本得到的结果:
['Gold': <td class="s1" dir="ltr">Johnny <span style="bold">M.</span></td>, 'Silver': <td class="s1" dir="ltr">Maria <span style="bold">R.</span></td>]

我想删除我的itemContent项目周围的<td></td>标签,但我不能使用"。text"就像我在itemType上做的那样,因为我需要保留<span style="bold">标签以便稍后在我的代码中重用它。

最好的解决方法是什么?我找了三个小时都没找到。显然,.unwrap()可能是有用的,但当我把它添加到我的代码,我得到一个错误。

感谢您的阅读!

朱利安

您可以使用element.decode_contents()获取innerHTML

for row in rows: # For each row
itemType = row.find_all("td")[0].text # Define the first cell as item type
itemContent = row.find_all("td")[2] # Define the third cell as item content
itemTypeList.append(itemType) # Add item type to the item types list
itemContentList.append(itemContent.decode_contents()) # Add item content to the item contents list
mailContent = {itemTypeList[i]: itemContentList[i] for i in range(len(itemTypeList))} # Create a dictionary with type and content for each item

输出:

{'Gold': 'Johnny <span style="bold">M.</span>', 'Silver': 'Maria <span style="bold">R.</span>'}

您也可以使用row.contents

mailContent = []
for row in rows:
itemType = row.contents[1].text
itemContent = row.contents[5]
mailContent.append({
itemType  : "{} {}".format(itemContent.text, itemContent.span)
})
print(mailContent)

最新更新