我正在用Splash创建我的第一个scraby项目，并使用http://quotes.toscrape.com/js/我想将每个页面的引号作为一个单独的文件存储在磁盘上(在下面的代码中，我首先尝试存储整个页面(。我有下面的代码，它在我不使用SplashRequest时起作用，但有了下面的新代码，当我在Visual Studio代码中"运行和调试"此代码时，磁盘上现在什么都没有存储。此外，self.log不会写入我的Visual Code Terminal窗口。我是Splash的新手，所以我肯定我错过了什么，但什么？

已经在这里和这里检查过了。

import scrapy
from scrapy_splash import SplashRequest
class QuoteItem(scrapy.Item):
author = scrapy.Field()
quote = scrapy.Field()   
class MySpider(scrapy.Spider):
name = "jsscraper"

start_urls = ["http://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
def parse(self, response):
for q in response.css("div.quote"):            
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote
#cycle through all available pages
for a in response.css('ul.pager a'):
yield SplashRequest(url=a,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 })

page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

更新1

我如何调试它：

在Visutal Studio代码中，点击F5
选择"Python文件">

输出选项卡为空

终端选项卡包含：

PS C:scrapytutorial>  cd 'c:scrapytutorial'; & 'C:UsersMarkAppDataLocalProgramsPythonPython38-32python.exe' 'c:UsersMark.vscodeextensionsms-python.python-2020.9.114305pythonFileslibpythondebugpylauncher' '58582' '--' 'c:scrapytutorialspidersquotes_spider_js.py'
PS C:scrapytutorial>

此外，我的Docker容器实例中没有任何记录，我认为这是Splash工作所必需的。

更新2

我运行了scrapy crawl jsscraper，磁盘上存储了一个文件"quotes-js.html"。但是，它包含页面源HTML，而没有执行任何JavaScript代码。我希望在'上执行JS代码http://quotes.toscrape.com/js/'，并仅存储报价内容。我该怎么做？

将输出写入JSON文件：

我已经尽力解决你的问题了。这是您的代码的工作版本。我希望这就是你们正在努力实现的目标。

import json
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 0.5}
)
def parse(self, response):
quotes = {"quotes": []}
for q in response.css("div.quote"):
quote = dict()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
quotes["quotes"].append(quote)
page = response.url[response.url.index("page/")+5:]
print("page=", page)
filename = 'quotes-%s.json' % page
with open(filename, 'w') as outfile:
outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))

更新：上面的代码已经更新，可以从所有页面中抓取结果，并将结果保存在从第1页到第10页的单独json文件中。

这将把每个页面的引号列表写入一个单独的json文件，如下所示：

{
"quotes":[
{
"author":"Albert Einstein",
"quote":"u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.u201d"
},
{
"author":"J.K. Rowling",
"quote":"u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.u201d"
},
{
"author":"Albert Einstein",
"quote":"u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.u201d"
},
{
"author":"Jane Austen",
"quote":"u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d"
},
{
"author":"Marilyn Monroe",
"quote":"u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.u201d"
},
{
"author":"Albert Einstein",
"quote":"u201cTry not to become a man of success. Rather become a man of value.u201d"
},
{
"author":"Andru00e9 Gide",
"quote":"u201cIt is better to be hated for what you are than to be loved for what you are not.u201d"
},
{
"author":"Thomas A. Edison",
"quote":"u201cI have not failed. I've just found 10,000 ways that won't work.u201d"
},
{
"author":"Eleanor Roosevelt",
"quote":"u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.u201d"
},
{
"author":"Steve Martin",
"quote":"u201cA day without sunshine is like, you know, night.u201d"
}
]
}

问题

您希望抓取的网站上的JavaScript没有执行。

解决方案

增加ScrappyRequest等待时间以允许JavaScript执行。

示例

yield SplashRequest(
url=url,
callback=self.parse,
endpoint='render.html',
args={ 'wait': 0.5 }
)

使用Scrapy Splash将响应存储为文件

问题

解决方案

示例

相关内容

最新更新

热门标签：