我在txt文件中有一个URL列表,然后在contact_page_patterns的列表中。我只需要检查这些特定页面以抓取URL的电子邮件。
请建议我如何做一些可能性。我是Python和Scrapy的新手。先感谢您。
class FinalspiderSpider(scrapy.Spider):
name = "finalspider"
source_urls = open("/Users/NiveRam/Documents/urllist.txt","rb")
start_urls = [url.strip() for url in source_urls.readlines()]
contact_page_pattern = ['help','office','global','feedback','branch','contact','about']
def parse(self, response):
hxs = HtmlXPathSelector(response)
emails = re.findall(r'[w.-]+@[w.-]+', response.body)
story = FinaltestItem()
story["url"] = response.url
story["title"] = response.xpath("//title/text()").extract()
story["email"] = emails
return(story)
这将检索电子邮件构成网页的整个主体,并输出电子邮件
电子邮件:[info@abc.com,infor@abc.com,yourname@abc.com]
您可以通过response
对象的url
属性访问当前URL:
class MySpider(scrapy.Spider):
url_keywords = ['stackoverflow', 'tea']
def parse(self, response):
story = FinaltestItem()
# check if any of defined keywords can be found in response.url
get_email = any(k in response.url for k in self.url_keywords)
if get_email: # if yes add in email!
emails = re.findall(r'[w.-]+@[w.-]+', response.body)
story["email"] = emails
story["url"] = response.url
story["title"] = response.xpath("//title/text()").extract()
return story