这是我使用的完整html。
这是上面HTML的简化版本:
<table class="premium">
<tr class="retailer top-offer" data-pricer="47.84" saler-id="123">...</td>
<tr class="retailer" data-pricer="57.11" saler-id="234">...</td>
</table>
<table class="basic-supp">
<tr class="retailer top-offer" data-pricer="41.87" saler-id="456">...</td>
<tr class="retailer" data-pricer="58.12" saler-id="567">...</td>
</table>
从表class="basic-supp">从TR标签和从data-pricer="…">属性我需要提取值。
我在简化的html上尝试了这个方法:
from bs4 import BeautifulSoup
with open('file.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
tags = soup.find_all('tr')
for tag in tags:
print(tag.attrs['data-pricer'])
> 47.84
> 57.11
> 41.87
> 58.12
这几乎是我所需要的,除了它从两个表中获取值,而不是使用class="basic-supp">的表。知道怎么修吗?
和主要的问题是它不工作在所有完整的html我上面张贴。错误:
print(tag.attrs['data-pricer'])
KeyError: 'data-pricer'
有谁能给我点建议吗?感谢您的宝贵时间!注:这甚至不是用beautifulsoup
提取属性值的重复。使用css选择器更简单:
data = []
for tr in soup.select('table.basic-supp tr'):
data.append([tr['data-pricer'],tr['saler-id'] ])
print(data)
或者,如果您想使用极端的列表推导式,只需一行:
[[tr['data-pricer'],tr['saler-id']] for tr in soup.select('table.basic-supp tr')]
无论哪种情况,输出都应该是:
[['41.87', '456'], ['58.12', '567']]
先找到<tr>
,然后用tr['data-pricer']
得到你想要的
试试这个:
from bs4 import BeautifulSoup
html = '''
<table class="premium">
<tr class="retailer top-offer" data-pricer="47.84" saler-id="123">...</td>
<tr class="retailer" data-pricer="57.11" saler-id="234">...</td>
</table>
<table class="basic-supp">
<tr class="retailer top-offer" data-pricer="41.87" saler-id="456">...</td>
<tr class="retailer" data-pricer="58.12" saler-id="567">...</td>
</table>
'''
soup = BeautifulSoup(html , 'html.parser')
for table in soup.find_all("table", {"class": "basic-supp"}):
for tr in table.find_all('tr'):
print(tr['data-pricer'])