如何将数据从Flask API传递到web scraper?



我正在研究一个应用程序项目,该项目允许用户在输入一组将被发送到Ask的关键字后获得网页搜索结果。为此,我在Flask和scrapy中创建了一个api,灵感来自下面的api文章。但是,这个api不能工作,因为我不能将用作关键字的数据从我的api传递到我的scraper。以下是我的flask api文件:

import crochet
crochet.setup()
from flask import Flask , render_template, jsonify, request, redirect, url_for
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.signalmanager import dispatcher
import time
import os
# Importing our Scraping Function from the amazon_scraping file
from scrap.askScraping import AskScrapingSpider
# Creating Flask App Variable
app = Flask(__name__)
output_data = []
crawl_runner = CrawlerRunner()
# By Deafult Flask will come into this when we run the file
@app.route('/')
def index():
return render_template("index.html") # Returns index.html file in templates folder.

# After clicking the Submit Button FLASK will come into this
@app.route('/', methods=['POST'])
def submit():
if request.method == 'POST':
s = request.form['url'] # Getting the Input Amazon Product URL
global baseURL
baseURL = s
# This will remove any existing file with the same name so that the scrapy will not append the data to any previous file.
if os.path.exists("<path_to_outputfile.json>"): 
os.remove("<path_to_outputfile.json>")

return redirect(url_for('scrape')) # Passing to the Scrape function

@app.route("/scrape")
def scrape():
scrape_with_crochet(baseURL="https://www.ask.com/web?q={baseURL}") # Passing that URL to our Scraping Function
time.sleep(20) # Pause the function while the scrapy spider is running

return jsonify(output_data) # Returns the scraped data after being running for 20 seconds.

@crochet.run_in_reactor
def scrape_with_crochet(baseURL):
# This will connect to the dispatcher that will kind of loop the code between these two functions.
dispatcher.connect(_crawler_result, signal=signals.item_scraped)

# This will connect to the ReviewspiderSpider function in our scrapy file and after each yield will pass to the crawler_result function.
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
return eventual
#This will append the data to the output data list.
def _crawler_result(item, response, spider):
output_data.append(dict(item))

if __name__== "__main__":
app.run(debug=True)

my scraper one

import scrapy
import datetime

class AskScrapingSpider(scrapy.Spider):
name = 'ask_scraping'
def start_requests(self):
myBaseUrl = ''
start_urls = []
def __init__(self, category='', **kwargs): # The category variable will have the input URL.
self.myBaseUrl = category
self.start_urls.append(self.myBaseUrl)
super().__init__(**kwargs)
custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15} # This will tell scrapy to store the scraped data to outputfile.json and for how long the spider should run.

yield scrapy.Request(start_urls, callback=self.parse, meta={'pos': 0})




def parse(self, response):
print('url:', response.url)

start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')    

items = response.css('div.PartialSearchResults-item')

for pos, result in enumerate(items, start_pos+1):
yield {
'title':    result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(), 
'snippet':  result.css('p.PartialSearchResults-item-abstract::text').get().strip(), 
'link':     result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'), 
'position': pos, 
'date':     dt,
}
# --- after loop ---

next_page = response.css('.PartialWebPagination-next a')

if next_page:
url = next_page.attrib.get('href')
print('next_page:', url)  # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})

运行时绝对没有错误。在看到一个用户对我的问题的回答后,我改变了我的scraper的代码如下,但没有成功,因为在传递数据到抓取后,我在我的浏览器中得到以下urllocalhost:5000/scrape空括号[],而括号通常应该包含我的scraper返回的数据:

import scrapy
import datetime

class AskScrapingSpider(scrapy.Spider):
name = 'ask_scraping'
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15}
def __init__(self, category='', **kwargs):
self.myBaseUrl = category
self.start_urls.append(self.myBaseUrl)
super().__init__(**kwargs)






def parse(self, response):
print('url:', response.url)

start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')    

items = response.css('div.PartialSearchResults-item')

for pos, result in enumerate(items, start_pos+1):
yield {
'title':    result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(), 
'snippet':  result.css('p.PartialSearchResults-item-abstract::text').get().strip(), 
'link':     result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'), 
'position': pos, 
'date':     dt,
}
# --- after loop ---

next_page = response.css('.PartialWebPagination-next a')

if next_page:
url = next_page.attrib.get('href')
print('next_page:', url)  # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})

我也用

替换了我的main.py文件crawl_runner = CrawlerRunner()
project_settings = get_project_settings()
crawl_runner = CrawlerProcess(settings = project_settings)

并执行以下导入

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

但是当我重新加载我的Flask服务器时,我得到以下错误:

2022-06-21 11:44:55 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-21 11:44:57 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Windows-10-10.0.19044-SP0
2022-06-21 11:44:57 [werkzeug] WARNING:  * Debugger is active!
2022-06-21 11:44:57 [werkzeug] INFO:  * Debugger PIN: 107-226-838
2022-06-21 11:44:57 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:44:57 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:44:57] "GET / HTTP/1.1" 200 -
2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:Python310libsite-packagestwistedinternetbase.py", line 1315, in run
self.mainLoop()
File "C:Python310libsite-packagestwistedinternetbase.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:Python310libsite-packagestwistedinternetbase.py", line 964, in runUntilCurrent
f(*a, **kw)
File "C:Python310libsite-packagescrochet_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
File "C:Python310libsite-packagestwistedinternetdefer.py", line 190, in maybeDeferred
result = f(*args, **kwargs)
File "C:UsersuserDocumentsAAprojectsWhelpsgroups1APImain.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:Python310libsite-packagesscrapycrawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:Python310libsite-packagesscrapycrawler.py", line 82, in __init__
default.install()
File "C:Python310libsite-packagestwistedinternetselectreactor.py", line 194, in install
installReactor(reactor)
File "C:Python310libsite-packagestwistedinternetmain.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:Python310libsite-packagestwistedinternetbase.py", line 1315, in run
self.mainLoop()
File "C:Python310libsite-packagestwistedinternetbase.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:Python310libsite-packagestwistedinternetbase.py", line 964, in runUntilCurrent
f(*a, **kw)
File "C:Python310libsite-packagescrochet_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
File "C:Python310libsite-packagestwistedinternetdefer.py", line 190, in maybeDeferred
result = f(*args, **kwargs)
File "C:UsersuserDocumentsAAprojectsWhelpsgroups1APImain.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:Python310libsite-packagesscrapycrawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:Python310libsite-packagesscrapycrawler.py", line 82, in __init__
default.install()
File "C:Python310libsite-packagestwistedinternetselectreactor.py", line 194, in install
installReactor(reactor)
File "C:Python310libsite-packagestwistedinternetmain.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-06-21 11:45:54 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:45:54] "←[32mPOST / HTTP/1.1←[0m" 302 -
2022-06-21 11:45:54 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:Python310libsite-packagestwistedinternetbase.py", line 1315, in run
self.mainLoop()
File "C:Python310libsite-packagestwistedinternetbase.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:Python310libsite-packagestwistedinternetbase.py", line 964, in runUntilCurrent
f(*a, **kw)
File "C:Python310libsite-packagescrochet_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
File "C:Python310libsite-packagestwistedinternetdefer.py", line 190, in maybeDeferred
result = f(*args, **kwargs)
File "C:UsersuserDocumentsAAprojectsWhelpsgroups1APImain.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:Python310libsite-packagesscrapycrawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:Python310libsite-packagesscrapycrawler.py", line 82, in __init__
default.install()
File "C:Python310libsite-packagestwistedinternetselectreactor.py", line 194, in install
installReactor(reactor)
File "C:Python310libsite-packagestwistedinternetmain.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:Python310libsite-packagestwistedinternetbase.py", line 1315, in run
self.mainLoop()
File "C:Python310libsite-packagestwistedinternetbase.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:Python310libsite-packagestwistedinternetbase.py", line 964, in runUntilCurrent
f(*a, **kw)
File "C:Python310libsite-packagescrochet_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
File "C:Python310libsite-packagestwistedinternetdefer.py", line 190, in maybeDeferred
result = f(*args, **kwargs)
File "C:UsersuserDocumentsAAprojectsWhelpsgroups1APImain.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:Python310libsite-packagesscrapycrawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:Python310libsite-packagesscrapycrawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:Python310libsite-packagesscrapycrawler.py", line 82, in __init__
default.install()
File "C:Python310libsite-packagestwistedinternetselectreactor.py", line 194, in install
installReactor(reactor)
File "C:Python310libsite-packagestwistedinternetmain.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

我查看了这个stackOverflow问题,但是没有成功。

你不应该让步

scrapy。请求

在<<p> strong> init 方法。删除这一行:

yield scrapy.Request(start_urls, callback=self.parse, meta={'pos': 0})

并更改init方法如下:

custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15}
def __init__(self, category='', **kwargs):
self.myBaseUrl = category
self.start_urls.append(self.myBaseUrl)
super().__init__(**kwargs)

it might work.

更新:

如果你想在请求中传递参数,在修改这些行之后,你可以像这样重写start_requests()方法:

def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})

更新2:如果你的scrapy spider运行在flask应用的后台,试试这个:写这些行:

project_settings = get_project_settings()
crawl_runner = CrawlerProcess(settings = project_settings)

代替:

crawl_runner = CrawlerRunner()

当然你应该导入CrawlerProcessget_project_settings这样的:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

更新3:我写过一些这样的项目,它工作正常,你可以检查这个repo

最新更新