如何从基于放置请求工作的无限滚动网站下载HTML内容..?



下载 HTML 内容时出现问题

我正在从事一个学术项目,该项目需要从雅虎答案页面收集类别
"政治与政府"的数据。我能够使用以下代码以 JSON 格式提取数据可以
有人请帮我下载完整网页的HTML内容

import json
import scrapy
from scrapy.crawler import CrawlerProcess
category_dict = {'Arts&Humanities': '396545012', 'Beauty&Style': '396545144', 'Business&Finance': '396545013',
'Cars&Transportation': '396545311', 'Computers&Internet': '396545660',
'ConsumerElectronics': '396545014',
'DiningOut': '396545327', 'Education&Reference': '396545015', 'Entertainment&Music': '396545016',
'Environment': '396545451', 'Family&RelationShips': '396545433', 'Food&Drink': '396545367',
'Games&Recreation': '396545019', 'Health': '396545018', 'Home&Garden': '396545394',
'LocalBusinesses': '396545401', 'News&Events': '396545439', 'Pets': '396545443',
'Politics&Government': '396545444', 'Pregnancy&Parenting': '396546046',
'Science&Mathematics': '396545122',
'SocialScience': '396545301', 'Society&Culture': '396545454', 'Sports': '396545213',
'Travel': '396545469',
'YahooProducts': '396546089'
}

class YahooAnswers(scrapy.Spider):

name = "test"
# API URL
api_url = 'https://answers.yahoo.com/_reservice_/'
# API headers
api_headers = {
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
# custom headers
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
# HTTP PUT request payload
payload = {
"type": "CALL_RESERVICE",
"payload": {
# change the category ID to retrieve proper questions
# e.g. you have URL: https://answers.yahoo.com/dir/index/discover?sid=396545443
# so you need to look at "?sid=396545443" string query parameter
# and extract the number 396545443 to use it as the "categoryId" below
"categoryId": "396545444",
"lang": "en-US",
"count": 20,
"offset": "pc00~p:0"
},
"reservice": {
"name": "FETCH_DISCOVER_STREAMS_END",
"start": "FETCH_DISCOVER_STREAMS_START",
"state": "CREATED"
}
}
# data offset
data_offset = 0
# crawler'a entry point
def start_requests(self):
# make HTTP PUT request to API URL
yield scrapy.Request(
url=self.api_url,
method='PUT',
headers=self.api_headers,
body=json.dumps(self.payload),
callback=self.parse
)
# parse questions callback method
def parse(self, response):
json_data = json.loads(response.text)
filename = "NewsandEvents.txt"
# check if next bunch of data available
if json_data['payload']['canLoadMore']:
# update data offset
self.data_offset += 20
# update payload offset
self.payload['payload']['offset'] = 'pc' + str(self.data_offset) + '~p:0'
# crawl next bunch of data
yield scrapy.Request(
url=self.api_url,
method='PUT',
headers=self.api_headers,
body=json.dumps(self.payload),
callback=self.parse
)
with open(filename, 'a') as f:
f.write(response.txt)

场景 #1- 您不需要页面 HTML,您需要数据

现在你已经有了JSON,你就有了你需要的一切。 在parse方法中,只需从 JSON 中读取值即可创建字典并生成它们。您可以输出到CSV,JSON等。

您的代码如下所示:

for questions in json_data['payload']['questions']:
yield {
'title':question['title'],
'detail':question['detail'],
# More attributes here
}
# check if next bunch of data available ...
# Rest of your logic to fetch next page

场景#2- 您实际上需要页面的HTML,您必须使用诸如Selenium或splash之类的东西来呈现。

希望对你有帮助

最新更新