我创建了一个脚本,使用"concurrent.futures"从网站上抓取一些数据点。该脚本是在我目前使用它的方式完美地工作。但是,我希望将链接作为列表提供给"future_to_url"。块,而不是一次一个链接。
这是我目前正在尝试的。
links = [
'first link',
'second link',
'third link',
]
def get_links(link):
while True:
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select("[data-testid='serp-ia-card'] [class*='businessName'] a[href^='/biz/'][name]"):
shop_name = item.get_text(strip=True)
shop_link = item.get('href')
yield shop_name,shop_link
next_page = soup.select_one("a.next-link[aria-label='Next']")
if not next_page: return
link = next_page.get("href")
def get_content(shop_name,shop_link):
res = requests.get(shop_link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
try:
phone = soup.select_one("p:-soup-contains('Phone number') + p").get_text(strip=True)
except (AttributeError,TypeError): phone = ""
return shop_name,shop_link,phone
if __name__ == '__main__':
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
"""
would like to supply the list of links to
the "future_to_url" block instead of one link at a time
"""
for link in links:
future_to_url = {executor.submit(get_content, *elem): elem for elem in get_links(link)}
for future in concurrent.futures.as_completed(future_to_url):
shop_name,shop_link,phone = future.result()[0],future.result()[1],future.result()[2]
print(shop_name,shop_link,phone)
我认为你需要的是executor.map
,你可以传递一个可迭代对象。
我已经简化了你的代码,因为你没有提供链接,但这应该给你一个大致的想法。
方法如下:
import concurrent.futures
from itertools import chain
import requests
from bs4 import BeautifulSoup
links = [
'https://stackoverflow.com/questions/tagged/beautifulsoup?sort=Newest&filters=NoAnswers&uqlId=30134',
'https://stackoverflow.com/questions/tagged/python-3.x+web-scraping?sort=Newest&filters=NoAnswers&uqlId=27838',
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
}
def get_links(source_url: str) -> list:
soup = BeautifulSoup(s.get(source_url, headers=headers).text, "html.parser")
return [
(a.getText(), f"https://stackoverflow.com{a['href']}") for a
in soup.select(".s-post-summary--content .s-post-summary--content-title a")
]
def get_content(content_data: tuple) -> str:
question, url = content_data
user = (
BeautifulSoup(s.get(url, headers=headers).text, "html.parser")
.select_one(".user-info .user-details a")
)
return f"{question}n{url}nAsked by: {user.getText()}"
if __name__ == '__main__':
with requests.Session() as s:
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
results = executor.map(
get_content,
chain.from_iterable(executor.map(get_links, links)),
)
for result in results:
print(result)
打印所有的题目和谁问的。在本例中,我访问问题页面以获取用户名。
Can't use find after find_all while making a loop for parsing
https://stackoverflow.com/questions/74570141/cant-use-find-after-find-all-while-making-a-loop-for-parsing
Asked by: wasdy
The results are different from the VS code when using AWS lambda.(selenium,BeautifulSoup)
https://stackoverflow.com/questions/74557551/the-results-are-different-from-the-vs-code-when-using-aws-lambda-selenium-beaut
Asked by: user20588340
Beautiful on Python not as expected
https://stackoverflow.com/questions/74554271/beautiful-on-python-not-as-expected
Asked by: Woody1193
Selenium bypass login
https://stackoverflow.com/questions/74551814/selenium-bypass-login
Asked by: Python12492
Tags not found with BeautifulSoup parsing
https://stackoverflow.com/questions/74551202/tags-not-found-with-beautifulsoup-parsing
Asked by: Reem Aljunaid
When I parse a large XML sitemap on Beautifulsoup in Python, it only parses part of the file
https://stackoverflow.com/questions/74543726/when-i-parse-a-large-xml-sitemap-on-beautifulsoup-in-python-it-only-parses-part
Asked by: JS0NBOURNE
How can I solve Http Error 308: Permanent Redirect in Data Crawling?
https://stackoverflow.com/questions/74541173/how-can-i-solve-http-error-308-permanent-redirect-in-data-crawling
Asked by: Illubith
and more ...