Python初学者在这里,我试图创建一个脚本来从表中检索数据并将其组织在字典中。
HTML的结构是这样的:
[...previous code]
<table class="waffle" cellspacing="0" cellpadding="0">
<tbody>
<tr style='height:39px;'>
<td class="s0" dir="ltr">Gold</td>
<td class="freezebar-cell"></td>
<td class="s1" dir="ltr">Johnny <span style="bold">M.</span></td>
</tr>
<tr style='height:39px;'>
<td class="s0" dir="ltr">Silver</td>
<td class="freezebar-cell"></td>
<td class="s1" dir="ltr">Maria <span style="bold">R.</span></td>
</tr>
[rest of the code...]
我当前的脚本是这样的:
from bs4 import BeautifulSoup
itemTypeList = [] # Create list of item types
itemContentList = [] # Create list of item contents
soup = BeautifulSoup(open("test/myfile.html"), "lxml") # Open the file
table_body = soup.find("tbody") # Find the table
rows = table_body.find_all("tr") # Find the rows
for row in rows: # For each row
itemType = row.find_all("td")[0].text # Define the first cell as item type
itemContent = row.find_all("td")[2] # Define the third cell as item content
itemTypeList.append(itemType) # Add item type to the item types list
itemContentList.append(itemContent) # Add item content to the item contents list
mailContent = {itemTypeList[i]: itemContentList[i] for i in range(len(itemTypeList))} # Create a dictionary with type and content for each item
下面是我使用这个脚本得到的结果:
['Gold': <td class="s1" dir="ltr">Johnny <span style="bold">M.</span></td>, 'Silver': <td class="s1" dir="ltr">Maria <span style="bold">R.</span></td>]
我想删除我的itemContent项目周围的<td></td>
标签,但我不能使用"。text"就像我在itemType上做的那样,因为我需要保留<span style="bold">
标签以便稍后在我的代码中重用它。
最好的解决方法是什么?我找了三个小时都没找到。显然,.unwrap()
可能是有用的,但当我把它添加到我的代码,我得到一个错误。
感谢您的阅读!
朱利安
您可以使用element.decode_contents()
获取innerHTML
for row in rows: # For each row
itemType = row.find_all("td")[0].text # Define the first cell as item type
itemContent = row.find_all("td")[2] # Define the third cell as item content
itemTypeList.append(itemType) # Add item type to the item types list
itemContentList.append(itemContent.decode_contents()) # Add item content to the item contents list
mailContent = {itemTypeList[i]: itemContentList[i] for i in range(len(itemTypeList))} # Create a dictionary with type and content for each item
输出:
{'Gold': 'Johnny <span style="bold">M.</span>', 'Silver': 'Maria <span style="bold">R.</span>'}
您也可以使用row.contents
mailContent = []
for row in rows:
itemType = row.contents[1].text
itemContent = row.contents[5]
mailContent.append({
itemType : "{} {}".format(itemContent.text, itemContent.span)
})
print(mailContent)