如何用请求和BeautifulSoup抓取WSJ头条?

WSJ不想被解析-我有这个功能:

def get_wsj_news():
global prev_news_wsj
url = "https://www.wsj.com/news/world"
news = []
news_to_post = []
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
news_list = soup.find_all("h3", {"class": "WSJTheme--headline--unZqjb45"})
for item in news_list[:15]:
headline = item.text.strip()
news.append(f"• {headline}")
news_to_post.append(f"• <b>{headline}</b>")
if not news or news == prev_news_wsj:
return {"site": None, "message": None}
else:
prev_news_wsj = news
print(url)
return {"site": "The Wall Street Journal", "message": "n".join(news_to_post)}
except Exception as e:
print(e)

但是当我试图解析<h3>标签时，我看到这个:

我们找不到您要找的页面。如果您在浏览器中输入了URL，请检查是否输入正确。如果您通过我们的网站或搜索到达此页面，请通过电子邮件support@wsj.com告诉我们

WSJ会有它的理由，这应该得到尊重——这些页面是为人类而不是机器人制作的，所以如果你表现得像个人类，内容就会向你开放。

在这种情况下，在请求中使用user agent已经足够了，但是如果它们检测到您的活动并将其分类为不可接受的，则这可能会改变。

因此，再次以尊重的态度对待网站及其内容，不要通过草率的行为和不必要的抓取来伤害它。

这只是显示技术的观点，不反映伦理的观点。

import requests
from bs4 import BeautifulSoup
url = "https://www.wsj.com/news/world"
response = requests.get(url, headers={'user-agent':'some agent'})
soup = BeautifulSoup(response.content, "html.parser")
soup.find_all("h3", {"class": "WSJTheme--headline--unZqjb45"})

相关内容

最新更新

热门标签：