使用新 URL 再次调用



这是我的蜘蛛,它正在工作,但是,我如何在新找到的URL上发送另一个蜘蛛。现在,我正在存储所有以HTTP开头的链接,HTTPS或者如果它/我添加基本 URL。

然后我将迭代该数组并在新 URL 上调用一个新的蜘蛛(它在代码的末尾(

我无法抓取新 URL(我知道,因为控制台上没有显示print()(

import scrapy
import re
class GeneralSpider( scrapy.Spider ):
name = "project"
start_urls = ['https://www.url1.com/',
'http://url2.com']
def parse( self, response ):
lead = {}
lead['url'] = response.request.url
lead['data'] = {}
lead['data']['mail'] = []
lead['data']['number'] = []
selectors = ['//a','//p','//label','//span','//i','//b','//div'
'//h1','//h1','//h3','//h4','//h5','//h6','//tbody/tr/td']
atags = []
for selector in selectors:
for item in response.xpath( selector ):
name = item.xpath( 'text()' ).extract_first()
href = item.xpath( '@href' ).extract_first()
if selector == '//a' and href is not None and href !='' and href !='#':
if href.startswith("http") or href.startswith("https"):
atags.append( href )
elif href.startswith("/"):
atags.append( response.request.url + href )
if href is not None and href !='' and href !='#':
splitted = href.split(':')
if splitted[0] not in lead['data']['mail'] and splitted[0] == 'mailto':
lead['data']['mail'].append(splitted[1])
elif splitted[0] not in lead['data']['number'] and splitted[0] == 'tel':
lead['data']['number'].append(splitted[1])
else:
if name is not None and name != '':
mail_regex = re.compile( r'^(([^<>()[]\.,;:s@"]+(.[^<>()[]\.,;:s@"]+)*)|(".+"))@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}])|(([a-zA-Z-0-9]+.)+[a-zA-Z]{2,}))$' )
number_regex = re.compile( r'^(?:(+?d{2,3})|+?d{2,3})s?(?:d{4}[s*.-]?d{4}|d{3}[s*.-]?d{3}|d{2}([s*.-]?)d{2}1?d{2}(?:1?d{2})?)(?:1?d{2})?$' )
if name not in lead['data']['mail'] and re.match( mail_regex, name ):
lead['data']['mail'].append(name)
elif name not in lead['data']['number'] and re.match( number_regex, name ):
lead['data']['number'].append(name)
print( lead )
#I want here call parse method again but with new url
for tag in atags:
scrapy.Request( tag, callback=self.parse )

您需要在函数中返回 Request 对象。 由于您正在生成多个,因此您可以使用yield,如下所示:

yield scrapy.Request(tag, callback=self.parse)

"在回调函数中,你解析响应(网页(并返回带有提取数据的字典、Item 对象、Request 对象这些对象的迭代对象。"查看刮擦文档

最新更新