美丽的汤:获取表格中的文本(Python)



我试图使用 Beautifullsoup 从表内部获取值 (1.212,00),但 tr 没有定义的类这是我的尝试:

import requests 
from bs4 import BeautifulSoup
url = "http://www-sdc/ResultadoSalCon.asp"
bs = BeautifulSoup(requests.get(url).content, "html.parser")
trs = (
bs.find("td", {"b": "SALÁRIO-BASE"})
.find("table", {"class": "titulo"})
.findAll("tr")
)
for tr in trs:
if trs.index(tr) == 2:
tds = tr.findAll("td")
for td in tds:
if tds.index(td) == 3:
valor = td.get_text()
print(valor)

这是输入:

bs.find("td", {"b": "SALÁRIO-BASE"})
AttributeError: 'NoneType' object has no attribute 'find'

这是我尝试收集数据的站点的 HTML:

<p align="center" class="titulo"><b> COMPETÊNCIA: 01/2022</b></p>
<table border="1" width="80%" height="21" class="titulo"  cellSpacing="0" cellPadding = "2" align="center">
<tr align="center" height="17">
<td><b>CLASSE</b></td>
<td><b>SALÁRIO-BASE </b></td>
<td><b>ALÍQUOTA-AUTÔNOMO (%)</b></td>
<td><b>ALÍQUOTA-EMPREGADOR (%)</b></td>
<td><b>CONTRIBUIÇÃO-AUTÔNOMO </b></td>
<td><b>CONTRIBUIÇÃO-EMPREGADOR </b></td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">              1.212,00</td>
<td align="center">     20,00</td>
<td align="center">     20,00</td>
<td align="center">                242,40</td>
<td align="center">                242,40</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">              7.087,22</td>
<td align="center">     20,00</td>
<td align="center">     20,00</td>
<td align="center">              1.417,44</td>
<td align="center">              1.417,44</td>
</tr>

您可以尝试以下操作:

import pandas as pd
html = '''
<p align="center" class="titulo"><b> COMPETÊNCIA: 01/2022</b></p>
<table border="1" width="80%" height="21" class="titulo"  cellSpacing="0" cellPadding = "2" align="center">
<tr align="center" height="17">
<td><b>CLASSE</b></td>
<td><b>SALÁRIO-BASE </b></td>
<td><b>ALÍQUOTA-AUTÔNOMO (%)</b></td>
<td><b>ALÍQUOTA-EMPREGADOR (%)</b></td>
<td><b>CONTRIBUIÇÃO-AUTÔNOMO </b></td>
<td><b>CONTRIBUIÇÃO-EMPREGADOR </b></td>
</tr>
<tr>
<td align="center">1</td>
<td align="center">              1.212,00</td>
<td align="center">     20,00</td>
<td align="center">     20,00</td>
<td align="center">                242,40</td>
<td align="center">                242,40</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">              7.087,22</td>
<td align="center">     20,00</td>
<td align="center">     20,00</td>
<td align="center">              1.417,44</td>
<td align="center">              1.417,44</td>
</table>
'''
dfs = pd.read_html(html)
df = dfs[0]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
print(df)

这将返回:

CLASSE  SALÁRIO-BASE    ALÍQUOTA-AUTÔNOMO (%)   ALÍQUOTA-EMPREGADOR (%) CONTRIBUIÇÃO-AUTÔNOMO   CONTRIBUIÇÃO-EMPREGADOR
1   1   1.212,00    2000    2000    24240   24240
2   10  7.087,22    2000    2000    1.417,44    1.417,44

您现在可以从该数据帧访问各种信息,例如 df['SALÁRIO-BASE'][1](将返回 '1.212,00')。

您的代码存在 2 个问题:

  1. 如何找到相关表(您正在查找 td 中的表):

替换此内容:

trs = (
bs.find("td", {"b": "SALÁRIO-BASE"})
.find("table", {"class": "titulo"})
.findAll("tr")
)

有了这个:

trs = (
bs.find("table", {"class": "titulo"})
.findAll("tr")
)
  1. 以及您导航表树的方式。请记住,tr 和 td 指数从零开始:

替换此内容:

if trs.index(tr) == 2:
if tds.index(td) == 3:

有了这个:

if trs.index(tr) == 1:
if tds.index(td) == 1:

完整代码:

import requests 
from bs4 import BeautifulSoup
url = "http://www-sdc/ResultadoSalCon.asp"
bs = BeautifulSoup(requests.get(url).content, "html.parser")
trs = (
bs.find("table", {"class": "titulo"})
.findAll("tr")
)
for tr in trs:
if trs.index(tr) == 1:
tds = tr.findAll("td")
for td in tds:
if tds.index(td) == 1:
valor = td.get_text()
print(valor)

最新更新