Scrapy-xpath返回具有基于正则表达式匹配的内容的父节点

你好，

我正在尝试使用Scrapy反复获取网站信息。起点是一个列出URL的网站。我用以下代码获得了Scrapy的URL：步骤1:

def parse(self, response):
for href in response.css('.column a::attr(href)'):
full_url = response.urljoin(href.extract())
yield { 'url': full_url, }

然后，对于每个URL，我将寻找包含关键字的特定URL(由于我是Scrapy的新手，我现在正在单独执行每一步。最后，我想用一个蜘蛛来运行它)：步骤2:

def parse(self, response):
for href in response.xpath('//a[contains(translate(@href,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz"),"keyword")]/@href'):
full_url = response.urljoin(href.extract())
yield { 'url': full_url, }

到目前为止还不错，但最后一步：

步骤3：我想从返回的URL中获取特定信息，如果有的话。现在我遇到麻烦了；o) 我试图同谋：

使用正则表达式搜索其值/内容与正则表达式匹配的元素：([0-9][0-9][0-9][0-9].*[A-Z][A-Z])>>这与1234AB和/或1234AB匹配
返回整个父div(稍后，如果可能的话，如果没有父div，我想返回上面的两个父div，但这是稍后使用的)

因此，当您使用下面的HTML代码时，我希望返回父div()的内容。请注意，我不知道这门课，所以我在这方面无法匹配。

<html>
<head>
<title>Webpage</title>
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">some text</p>
<div class="contenttxt">
<h1>Info</h1>
<h4>header text</h4>
<p>something<br />
1234 AB</p>
<p>somthing else</p>
</div>
<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>
<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>

我试过的代码：

2016-05-31 18:59:32 [scrapy] INFO: Spider opened
2016-05-31 18:59:32 [scrapy] DEBUG: Crawled (200) <GET http://localhost/test/test.html> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f6bc2be0e90>
[s]   item       {}
[s]   request    <GET http://localhost/test/test.html>
[s]   response   <200 http://localhost/test/test.html>
[s]   settings   <scrapy.settings.Settings object at 0x7f6bc2be0d10>
[s]   spider     <DefaultSpider 'default' at 0x7f6bc2643b90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>> response.xpath('//*').re('([0-9][0-9][0-9][0-9].*[A-Z][A-Z])')
[u'1234 AB', u'1234 AB', u'1234 AB', u'1234 AB']

首先，它返回了4次匹配，所以至少它可以找到一些东西。我搜索了"scrapy-xpath返回父节点"，但这只给了我一个只得到一个结果的"解决方案"：

>>> response.xpath('//*/../../../..').re('([0-9][0-9][0-9][0-9].*[A-Z][A-Z])')
[u'1234 AB']

我也试过类似的东西：

>>> for nodes in response.xpath('//*').re('([0-9][0-9][0-9][0-9].*[A-Z][A-Z])'):
...     for i in nodes.xpath('ancestor:://*'):
...         print i
... 
Traceback (most recent call last):
File "<console>", line 2, in <module>
AttributeError: 'unicode' object has no attribute 'xpath'

但这也无济于事。希望有人能给我指明正确的方向。首先是因为我不知道为什么正则表达式匹配了4次，其次是因为我已经不知道要到达哪里了。刚刚回顾了"可能已经有你答案的问题"显示的大多数有希望的结果。但没有找到我的解决方案。我的最佳猜测是，我必须建立某种循环，然而，毫无头绪s

最后，我尝试获得一个，它输出包含在步骤1和步骤2中找到的URL的结果，以及来自步骤3的数据。

谢谢！KR，Onno。

在xpath选择器提取感兴趣的元素后，re方法提取数据，请查看文档以获取更多信息。如果你知道这个元素(在这种情况下可能是div)，你可以遍历所有检查其内容的div，或者在XPath中使用scrapy内置的对正则表达式的支持；使用您以前的示例，类似于以下内容：

response.xpath('//div[re:test(., "[0-9]{4}s?[A-Z]{2}")]').extract()

[u'<div class="contenttxt">n            <h1>Info</h1>n        <h4>header text</h4>nn        <p>something<br>n        1234 AB</p>nn        <p>somthing else</p>n      </div>']

相关内容

最新更新

热门标签：