我试图使用 Beautifullsoup 从表内部获取值 (1.212,00),但 tr 没有定义的类这是我的尝试:
import requests
from bs4 import BeautifulSoup
url = "http://www-sdc/ResultadoSalCon.asp"
bs = BeautifulSoup(requests.get(url).content, "html.parser")
trs = (
bs.find("td", {"b": "SALÁRIO-BASE"})
.find("table", {"class": "titulo"})
.findAll("tr")
)
for tr in trs:
if trs.index(tr) == 2:
tds = tr.findAll("td")
for td in tds:
if tds.index(td) == 3:
valor = td.get_text()
print(valor)
这是输入:
bs.find("td", {"b": "SALÁRIO-BASE"})
AttributeError: 'NoneType' object has no attribute 'find'
这是我尝试收集数据的站点的 HTML:
<p align="center" class="titulo"><b> COMPETÊNCIA: 01/2022</b></p>
<table border="1" width="80%" height="21" class="titulo" cellSpacing="0" cellPadding = "2" align="center">
<tr align="center" height="17">
<td><b>CLASSE</b></td>
<td><b>SALÁRIO-BASE </b></td>
<td><b>ALÍQUOTA-AUTÔNOMO (%)</b></td>
<td><b>ALÍQUOTA-EMPREGADOR (%)</b></td>
<td><b>CONTRIBUIÇÃO-AUTÔNOMO </b></td>
<td><b>CONTRIBUIÇÃO-EMPREGADOR </b></td>
</tr>
<tr>
<td align="center">1</td>
<td align="center"> 1.212,00</td>
<td align="center"> 20,00</td>
<td align="center"> 20,00</td>
<td align="center"> 242,40</td>
<td align="center"> 242,40</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center"> 7.087,22</td>
<td align="center"> 20,00</td>
<td align="center"> 20,00</td>
<td align="center"> 1.417,44</td>
<td align="center"> 1.417,44</td>
</tr>
您可以尝试以下操作:
import pandas as pd
html = '''
<p align="center" class="titulo"><b> COMPETÊNCIA: 01/2022</b></p>
<table border="1" width="80%" height="21" class="titulo" cellSpacing="0" cellPadding = "2" align="center">
<tr align="center" height="17">
<td><b>CLASSE</b></td>
<td><b>SALÁRIO-BASE </b></td>
<td><b>ALÍQUOTA-AUTÔNOMO (%)</b></td>
<td><b>ALÍQUOTA-EMPREGADOR (%)</b></td>
<td><b>CONTRIBUIÇÃO-AUTÔNOMO </b></td>
<td><b>CONTRIBUIÇÃO-EMPREGADOR </b></td>
</tr>
<tr>
<td align="center">1</td>
<td align="center"> 1.212,00</td>
<td align="center"> 20,00</td>
<td align="center"> 20,00</td>
<td align="center"> 242,40</td>
<td align="center"> 242,40</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center"> 7.087,22</td>
<td align="center"> 20,00</td>
<td align="center"> 20,00</td>
<td align="center"> 1.417,44</td>
<td align="center"> 1.417,44</td>
</table>
'''
dfs = pd.read_html(html)
df = dfs[0]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
print(df)
这将返回:
CLASSE SALÁRIO-BASE ALÍQUOTA-AUTÔNOMO (%) ALÍQUOTA-EMPREGADOR (%) CONTRIBUIÇÃO-AUTÔNOMO CONTRIBUIÇÃO-EMPREGADOR
1 1 1.212,00 2000 2000 24240 24240
2 10 7.087,22 2000 2000 1.417,44 1.417,44
您现在可以从该数据帧访问各种信息,例如 df['SALÁRIO-BASE'][1](将返回 '1.212,00')。
您的代码存在 2 个问题:
您- 如何找到相关表(您正在查找 td 中的表):
替换此内容:
trs = (
bs.find("td", {"b": "SALÁRIO-BASE"})
.find("table", {"class": "titulo"})
.findAll("tr")
)
有了这个:
trs = (
bs.find("table", {"class": "titulo"})
.findAll("tr")
)
- 以及您导航表树的方式。请记住,tr 和 td 指数从零开始:
替换此内容:
if trs.index(tr) == 2:
if tds.index(td) == 3:
有了这个:
if trs.index(tr) == 1:
if tds.index(td) == 1:
完整代码:
import requests
from bs4 import BeautifulSoup
url = "http://www-sdc/ResultadoSalCon.asp"
bs = BeautifulSoup(requests.get(url).content, "html.parser")
trs = (
bs.find("table", {"class": "titulo"})
.findAll("tr")
)
for tr in trs:
if trs.index(tr) == 1:
tds = tr.findAll("td")
for td in tds:
if tds.index(td) == 1:
valor = td.get_text()
print(valor)