网页抓取错误消息:"int"对象没有属性"get"



你好堆栈溢出贡献者!

我想抓取新闻网站的多个页面;在此步骤中显示一条错误消息

response = requests.get(page, headers = user_agent)

错误消息是

AttributeError: 'int' object has no attribute 'get'

代码行是

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
#controlling the crawl-rate
start_time = time() 
request = 0
def scrape(url):
urls = [url + str(x) for x in range(0,10)]
for page in urls:
response = requests.get(page, headers = user_agent)   
print(page)

print(scrape('https://nypost.com/search/China+COVID-19/page/'))

更具体地说,这个页面和它旁边的页面是我想抓取的:

https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

任何帮助将不胜感激!

对我来说,这段代码运行正常。我确实必须将request放入您的函数中。确保不要将模块requests与变量request混淆。

from random import randint
from time import sleep, time
from bs4 import BeautifulSoup as bs

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
# controlling the crawl-rate
start_time = time() 
def scrape(url):
request = 0
urls = [f"{url}{x}" for x in range(0,10)]
params = {
"orderby": "relevance",
}
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)   
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
#         clear_output(wait = True)
#throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(request, response.status_code))
#Break the loop if the number of requests is greater than expected
if request > 72:
warn('Number of request was greater than expected.')
break
#parse the content
soup_page = bs(response.text, 'lxml') 

print(scrape('https://nypost.com/search/China+COVID-19/page/'))

相关内容

  • 没有找到相关文章

最新更新