抓取站点时与状态代码不一致(200或403)



我正试图将seekingalpha.com新闻部分作为一个个人项目。然而,我似乎无法成功地模拟浏览器,因为一旦我到达第8页左右,我就会得到403 forbidden output code。如果我以私人模式打开浏览器,我可以手动浏览所有页面,这样我的IP就不会被阻止。

我在Python3.8中使用RequestsBeautifulsoup

我有:

  • 添加了合法的用户代理以及尝试的随机用户代理

  • 使用应该自动更新cookie的请求会话,我相信(?(

  • 添加了Referrer标头

  • 请求之间的延迟增加

这是我的代码:

import requests
import time
import random
import webbrowser
from bs4 import BeautifulSoup
import re
import sys
import os

class SeekingAlpha():
from fake_useragent import UserAgent
ua = UserAgent()
BASE_URL = 'https://seekingalpha.com/'
NEWS_URL = BASE_URL + 'articles?page={}'

def __init__(self):
self.session = requests.Session()
self.session.headers['User-Agent'] = 'Mozilla/5.0 (X11;  Ubuntu; Linux i686; rv:52.0) Gecko/20100101 Firefox/52.0'
response =self.session.get(self.BASE_URL)
response.raise_for_status() 
self.session.headers['Referrer'] = 'https://seekingalpha.com/'
print(self.session.headers)
self.master_urls = []
for i in range(1,100):        
page = self.session.get(self.NEWS_URL.format(i))
time.sleep(random.randint(3,5))
page.raise_for_status()
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', href = True)
links = [link for link in links if link.has_attr("sasource") and link['sasource'] == 'all_articles']
self.master_urls.extend(links) 

if __name__ == "__main__":
master_urls = SeekingAlpha()

编辑:

以下是我通过浏览器看到的第8页(为了不占用帖子中太多空间,删除了标题(:

"最新文章

亮点:

All
Top Ideas
Editors' Picks
Small-Cap Insight
Outstanding Contribution
Most Popular

文章|新闻|文字记录

Should I Open A Roth IRA Right Now? That Depends
Charles Lewis Sizemore, CFA • Thu, Apr. 30, 11:15 AM
China Continues To Lead World's Major Equity Regions In 2020
James Picerno • MCHI, SPY, VT• Thu, Apr. 30, 11:09 AM
Gold And Gas: 2 Anti-Recession Trades
Atlas Research • QQQ, UNG, SAND• Thu, Apr. 30, 11:05 AM
Excellent Total Return Bond Funds For Momentum-Based Fixed Income Portfolios
MyPlanIQ • TGMNX, BOND, DLTNX• Thu, Apr. 30, 11:04 AM
NXP's Share Price Already Assumes A Lot Of Growth And Improvement
Stephen Simpson, CFA • MCHP, RNECY, TXN• Thu, Apr. 30, 11:01 AM
[This article is one of the editors' picks] Chart Industries Worth Another Look With LNG Mostly Washed Out
Stephen Simpson, CFA • GTLS• Thu, Apr. 30, 10:53 AM
Dana Incorporated 2020 Q1 - Results - Earnings Call Presentation
SA Transcripts • DAN• Thu, Apr. 30, 10:43 AM
Don't Panic! Coronavirus, GDP, And Unemployment
CFA Institute Contributors • SPY, QQQ, DIA• Thu, Apr. 30, 10:42 AM
Predicting Depressions For Dummies, Part II
John Overstreet • SPY, QQQ, DIA• Thu, Apr. 30, 10:37 AM
Cognex Already Trading On Recovery Prospects
Stephen Simpson, CFA • FANUY, CGNX• Thu, Apr. 30, 10:29 AM
Meritor, Inc. 2020 Q2 - Results - Earnings Call Presentation
SA Transcripts • MTOR• Thu, Apr. 30, 10:28 AM

">

你试过增加随机睡眠吗?我认为3-5太低了,在你第8次请求后,网站可能会关闭你。要么增加它,要么如果你得了403,就去睡觉,过一段时间再试一次。

如果你真的需要尽快的数据,配置一个Tor代理,并使用它一段时间。(给你一个不同的外部IP-放弃你的会话以防万一(

有时,如果你的机器人太烦人,网站的所有者会把你赶出去(至少,这是我的经验://(。

最新更新