BeautifulSoup(bs4)未使用find_all、select或select_one获取元素



要爬网的Url示例:www.yelp.com/biz/dalls-marketing-rockstar-dalas?adjust_creative=3cZu3equ3emptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3当量3空F-Yfj2带

我的代码:

def get_websites():
for yso in Company.objects.filter(crawled=False, source='YAG'):
r = requests.get(yso.url)

soup = BeautifulSoup(r.content, 'lxml')
if soup.select_one(".g-recaptcha") != None:
sys.exit("Captcha")
soup_select = soup.select_one("a[href*='biz_redir']")
try:
yso.website = soup_select.text
print('website for %s added' % (yso.website))
except Exception as e:
print(e)
print('no website for %s added' % yso.name)
if not yso.crawled:
yso.crawled = True
yso.save()

在CSS选择器soup.select_one("a[href*='biz_redir']")中使用lxmlhtml.parser返回Nonesoup.select("a[href*='biz_redir']")也是空列表,soup.find_all("a[href*='biz_redir']")是空列表。

lxml version 4.5.0
beautifulsoup version 4.9.3

编辑:将"a[href*='biz_redir']"更改为仅a会产生相同的结果。如果语法是错误的,那么还有比语法更根本的错误。

数据是动态加载的,所以requests不支持。但是,链接是通过网站上的JSON格式加载的,您可以使用json模块提取。

import re
import json
import requests
from bs4 import BeautifulSoup
URL = "https://www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow%27"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
script = soup.select_one(
"#wrap > div.main-content-wrap.main-content-wrap--full > yelp-react-root > script"
).string
json_data = json.loads(re.search(r"({.*})", script).group(1))
print(
"https://yelp.com"
+ json_data["bizDetailsPageProps"]["bizContactInfoProps"]["businessWebsite"]["href"]
)

另一种选择是使用Selenium来抓取页面,它支持动态内容

使用:pip install selenium进行安装。

从这里下载正确的ChromeDriver。

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://www.yelp.com/biz/dallas-marketing-rockstar-dallas?adjust_creative=3cZu3ieq3omptvF-Yfj2ow&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=3cZu3ieq3omptvF-Yfj2ow%27"
driver = webdriver.Chrome(r"c:pathtochromedriver.exe")
driver.get(URL)
# Wait for the page to fully render
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
print("https://yelp.com" + soup.select_one("a[href*='biz_redir']")["href"])
driver.quit()

输出:

https://yelp.com/biz_redir?url=https%3A%2F%2Fwww.rockstar.marketing&website_link_type=website&src_bizid=CodEpKvY8ZM7IbCEWxpQ0g&cachebuster=1607826143&s=d214a1df7e2d21ba53939356ac6679631a458ec0360f6cb2c4699ee800d84520

最新更新