为了磨练我的python和Spark GraphX技能,我一直在尝试构建Pinboard用户和书签的图表。为此,我按以下方式递归抓取 Pinboard 书签:
- 从用户开始并抓取所有书签
- 对于由url_slug标识的每个书签,查找也保存了相同书签的所有用户。
- 对于步骤 2 中的每个用户,重复该过程(转到 1,...(
尽管已经尝试了来自此处多个线程的建议(包括使用规则(,但当我尝试实现此逻辑时,我收到以下错误:
错误:蜘蛛必须返回请求,基本项,字典或无,得到 "发电机">
我强烈怀疑这与我的代码中的这种yield
/return
混合有关。
以下是我的代码的快速描述:
我的主要 parse 方法查找一个用户的所有书签项(也跟随任何具有同一用户书签的先前页面(,并生成parse_bookmark
方法来抓取这些书签。
class PinSpider(scrapy.Spider):
name = 'pinboard'
# Before = datetime after 1970-01-01 in seconds, used to separate the bookmark pages of a user
def __init__(self, user='notiv', before='3000000000', *args, **kwargs):
super(PinSpider, self).__init__(*args, **kwargs)
self.start_urls = ['https://pinboard.in/u:%s/before:%s' % (user, before)]
self.before = before
def parse(self, response):
# fetches json representation of bookmarks instead of using css or xpath
bookmarks = re.findall('bmarks[d+] = ({.*?});', response.body.decode('utf-8'), re.DOTALL | re.MULTILINE)
for b in bookmarks:
bookmark = json.loads(b)
yield self.parse_bookmark(bookmark)
# Get bookmarks in previous pages
previous_page = response.css('a#top_earlier::attr(href)').extract_first()
if previous_page:
previous_page = response.urljoin(previous_page)
yield scrapy.Request(previous_page, callback=self.parse)
此方法抓取书签的信息,包括相应的url_slug,将其存储在 PinscrapyItem 中,然后生成一个scrapy.Request
来解析url_slug:
def parse_bookmark(self, bookmark):
pin = PinscrapyItem()
pin['url_slug'] = bookmark['url_slug']
pin['title'] = bookmark['title']
pin['author'] = bookmark['author']
# IF I REMOVE THE FOLLOWING LINE THE PARSING OF ONE USER WORKS (STEP 1) BUT NO STEP 2 IS PERFORMED
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)
return pin
最后,parse_url_slug
方法查找保存此书签的其他用户,并以递归方式生成一个scrape.Request
来分析每个用户。
def parse_url_slug(self, response):
url_slug = UrlSlugItem()
if response.body:
soup = BeautifulSoup(response.body, 'html.parser')
users = soup.find_all("div", class_="bookmark")
user_list = [re.findall('/u:(.*)/t:', element.a['href'], re.DOTALL) for element in users]
user_list_flat = sum(user_list, []) # Change from list of lists to list
url_slug['user_list'] = user_list_flat
for user in user_list:
yield scrapy.Request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)
return url_slug
(为了以更简洁的方式呈现代码,我删除了存储其他有趣字段或检查重复项等的部分。
任何帮助将不胜感激!
问题是你下面的代码块
yield self.parse_bookmark(bookmark)
由于在您的parse_bookmark
中,您有以下两行
# IF I REMOVE THE FOLLOWING LINE THE PARSING OF ONE USER WORKS (STEP 1) BUT NO STEP 2 IS PERFORMED
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)
return pin
由于您已yield
此函数的返回值是一个生成器。你把那个生成器还给了Scrapy,它不知道该怎么处理它。
修复很简单。将您的代码更改为以下内容
yield from self.parse_bookmark(bookmark)
这将一次从生成器而不是生成器本身产生一个值。或者你也可以这样做
for ret in self.parse_bookmark(bookmark):
yield ret
编辑-1
更改函数以首先生成项目
yield pin
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)
还有另一个
url_slug['user_list'] = user_list_flat
yield url_slug
for user in user_list:
yield scrapy.Request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)
稍后生成将首先安排许多其他请求,当您开始看到抓取的项目时,需要时间。我在代码上运行了更改,它对我来说很好
2017-08-20 14:02:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pinboard.in/u:%5B'semanticdreamer'%5D/before:3000000000>
{'url_slug': 'e1ff3a9fb18873e494ec47d806349d90fec33c66', 'title': 'Flair Conky Offers Dark & Light Version For All Linux Distributions - NoobsLab | Ubuntu/Linux News, Reviews, Tutorials, Apps', 'author': 'semanticdreamer'}
2017-08-20 14:02:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pinboard.in/url:d9c16292ec9019fdc8411e02fe4f3d6046185c58>
{'user_list': ['ronert', 'notiv']}