我正在编写一个零工程序,我在此网站上登录和刮擦数据,http://www.starcitygames.com/buylist/。但是我只从该URL刮擦ID值,然后使用该ID号重定向到另一个URL并刮擦JSON网页,并对所有207个不同类别的卡片进行此操作。我看起来更加真实,然后直接使用JSON数据到URL。无论如何,我以前曾使用多个URL编写纸巾计划,我能够将这些程序设置为旋转代理和用户代理,但是我该如何在此程序中做到这一点?由于从技术上讲只有一个URL,就像有没有办法将其设置为在5个左右的JSON数据页面之类的刮擦之后切换到其他代理和用户代理?我不希望它随机旋转。我希望它每次都可以用相同的代理和用户代理来刮擦相同的JSON网页。我希望一切都有意义。对于堆栈溢出而言,这可能有点广泛,但我不知道该怎么做,所以我想我还是要看看是否有人对如何执行此操作有好主意。
# Import needed functions and call needed python files
import scrapy
import json
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import DataItem
# Spider class
class LoginSpider(scrapy.Spider):
# Name of spider
name = "LoginSpider"
#URL where dated is located
start_urls = ["http://www.starcitygames.com/buylist/"]
# Login function
def parse(self, response):
# Login using email and password than proceed to after_login function
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'example@email.com', 'ex_usr_pass': 'password'},
callback=self.after_login
)
# Function to barse buylist website
def after_login(self, response):
# Loop through website and get all the ID numbers for each category of card and plug into the end of the below
# URL then go to parse data function
for category_id in response.xpath('//select[@id="bl-category-options"]/option/@value').getall():
yield scrapy.Request(
url="http://www.starcitygames.com/buylist/search?search-type=category&id={category_id}".format(category_id=category_id),
callback=self.parse_data,
)
# Function to parse JSON dasta
def parse_data(self, response):
# Declare variables
jsonreponse = json.loads(response.body_as_unicode())
# Call DataItem class from items.py
items = DataItem()
# Scrape category name
items['Category'] = jsonreponse['search']
# Loop where other data is located
for result in jsonreponse['results']:
# Inside this loop, run through loop until all data is scraped
for index in range(len(result)):
# Scrape the rest of needed data
items['Card_Name'] = result[index]['name']
items['Condition'] = result[index]['condition']
items['Rarity'] = result[index]['rarity']
items['Foil'] = result[index]['foil']
items['Language'] = result[index]['language']
items['Buy_Price'] = result[index]['price']
# Return all data
yield items
我将为您推荐此软件包砂纸 - 用户
pip install scrapy-useragents
在您的设置.py文件
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}
用户代理示例列表要旋转
更多用户代理
USER_AGENTS = [
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/57.0.2987.110 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/61.0.3163.79 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
'Gecko/20100101 '
'Firefox/55.0'), # firefox
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/61.0.3163.91 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/62.0.3202.89 '
'Safari/537.36'), # chrome
('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/63.0.3239.108 '
'Safari/537.36'), # chrome
]
小心这个中间件无法处理cookie_enabled是正确的情况,并且网站将cookie用用户代理绑定,这可能会导致蜘蛛的不可预测结果。
代理我会得到一家提供旋转器的公司,因此您不必弄乱它,但是您可以写一个自定义中间件,我将向您展示如何。您要做的是编辑过程请求方法。您将对更改代理以及更改用户代理进行此操作。
UserAgents 您可以使用scrapy随机用户代理中间件https://github.com/cleocn/scrapy-random-useragent,或者这是您可以使用中间件(包括代理或任何其他标头(更改对请求对象的任何内容的方法。
# middlewares.py
user_agents = ['agent1', 'agent2', 'agent3', 'agent4']
proxies = ['ip1:port1', 'ip2:port2', 'ip3:port3', 'ip4:port4'
# either have your user agents in a file or something this assumes you are able to get them into a list.
class MyMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
request.headers['User-Agent'] = random.choice(user_agents) # !! These 2 lines
request.meta['proxy'] = random.choice(proxies) # !! These 2 lines
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
# settings.py
DOWNLOADER_MIDDLEWARES = {
'project.middlewares.MyMiddleware': 543,
}
参考:https://docs.scrapy.org/en/latest/topics/request-response.html
用户:我已经使用了此工具,该工具将使您的用户代理列表始终使用最新和最常用的用户代理进行更新:https://pypi.org/project/shadow-useragent/
from shadow_useragent import ShadowUserAgent
shadow_useragent = ShadowUserAgent()
print(shadow_useragent.firefox)
# Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0
print(shadow_useragent.chrome)
# Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
print(shadow_useragent.safari)
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
print(shadow_useragent.edge)
# Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134
print(shadow_useragent.ie)
# Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
print(shadow_useragent.android)
# Mozilla/5.0 (Linux; U; Android 4.3; en-us; SM-N900T Build/JSS15J) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
print(shadow_useragent.ipad)
# Mozilla/5.0 (iPad; CPU OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Mobile/15E148 Safari/604.1
print(shadow_useragent.random)
# Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0
print(shadow_useragent.random_nomobile)
# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36
# and the best one, random via real world browser usage statistic
print(ua.random)
# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36
# if you want to excluse mobiles (some websites will display different pages)
print(shadow_useragent.random_nomobile)
# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36