BeautifulSoup说标签没有属性，同时寻找兄弟或父标签

我正在尝试提取以下网页(http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051)的税单。

税单是8,084.54美元的价值，直接在税单&评估字符串。

我需要使用一些静态对象，因为代码将在多个页面上工作。

税赋&Assessments"字符串在所有页面之间是一个常量，并且总是在完整的税单之前，而税单在页面之间变化。

我的想法是，我可以找到"税&Assessment"字符串，然后遍历BeautifulSoup树并找到Tax Bill。这是我的代码:

soup = BeautifulSoup(html_content,'html.parser') #Soupify the HTML content
tagTandA = soup.body.find(text = "Taxes & Assessments")
taxBill = tagTandA.find_next_sibling.text

返回错误:

AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

事实上，任何parent, next_sibling, find_next_sibling或其他类型的fcn返回此对象都没有属性错误。

我试着寻找其他显式文本，只是为了测试它不是这个特定的文本给我一个问题，并且没有属性错误仍然被抛出。

当只运行以下代码时，它返回"None"

tagTandA = soup.body.find(text = "Taxes & Assessments")

如何找到&;Taxes &;Assessments"标记，以便在树中导航以找到并返回Tax Bill?

如果我没弄错的话，你是在尝试使用请求&基于bs的解决方案来抓取(非常)JS重定向的网站，和一些iframe。

我认为这行不通。

下面是使用Selenium(有一些未使用的导入，您可以删除它们)获取该信息的一种方法(如果您愿意，您可以改进硬编码的等待，我只是没有时间去摆弄):

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051'
driver.get(url)
t.sleep(15)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[@name="body"]')))
total_taxes = wait.until(EC.element_to_be_clickable((By.XPATH, "//font[contains(text(), 'Taxes & Assessments')]/ancestor::td/following-sibling::td")))
print('Tax bill: ', total_taxes.text)

终端结果:

Tax bill:  $8,084.54

查看Selenium文档了解更多详细信息。

一个非常善良的人(u/commandlineuser)在这里用BS代码回答了这个问题:https://www.reddit.com/r/learnpython/comments/10hywbs/beautifulsoup_saying_tag_has_no_attributes_while/

代码如下:

import re
import requests
from   bs4 import BeautifulSoup
url = ""
r1 = requests.get(url)
soup1 = BeautifulSoup(r1.content, "html.parser")
base = r1.url[:r1.url.rfind("/") + 1]
href1 = soup1.find("frame").get("src")
r2 = requests.get(base + href1)
soup2 = BeautifulSoup(
r2.content
.replace(b"<!--", b"") # basic attempt at stripping comments
.replace(b"-->", b""),
"html.parser"
)
href2 = soup2.find("voicemax").get_text(strip=True)
r3 = requests.get(base + href2)
soup3 = BeautifulSoup(r3.content, "html.parser")
total = (
soup3.find(text=re.compile("Taxes & Assessments"))
.find_next()
.get_text(strip=True)
)
print(total)

find_next_sibling是一个函数/方法，使用find_next_sibling()。有一个类似的属性，所以我可以看到混乱。

相关内容

最新更新

热门标签：