来自给定URL列表的特定页面的抓取电子邮件



我在txt文件中有一个URL列表,然后在contact_page_patterns的列表中。我只需要检查这些特定页面以抓取URL的电子邮件。

请建议我如何做一些可能性。我是Python和Scrapy的新手。先感谢您。

   class FinalspiderSpider(scrapy.Spider):
       name = "finalspider"
       source_urls = open("/Users/NiveRam/Documents/urllist.txt","rb")
       start_urls = [url.strip() for url in source_urls.readlines()]
       contact_page_pattern = ['help','office','global','feedback','branch','contact','about']
       def parse(self, response):
           hxs = HtmlXPathSelector(response)
           emails = re.findall(r'[w.-]+@[w.-]+', response.body)
           story = FinaltestItem()
           story["url"] = response.url
           story["title"] = response.xpath("//title/text()").extract()
           story["email"] = emails
           return(story)

这将检索电子邮件构成网页的整个主体,并输出电子邮件

电子邮件:[info@abc.com,infor@abc.com,yourname@abc.com]

您可以通过response对象的url属性访问当前URL:

class MySpider(scrapy.Spider):
    url_keywords = ['stackoverflow', 'tea']
    def parse(self, response):
        story = FinaltestItem()
        # check if any of defined keywords can be found in response.url
        get_email = any(k in response.url for k in self.url_keywords)
        if get_email:  # if yes add in email!
            emails = re.findall(r'[w.-]+@[w.-]+', response.body)
            story["email"] = emails
        story["url"] = response.url
        story["title"] = response.xpath("//title/text()").extract()
        return story

最新更新