XPath/Scrapy scrape DOCTYPE

嗨，伙计们，

我正在使用Scrapy和XPath构建一个刮板。我对抓取感兴趣的是从我遍历的所有站点中抓取 DOCTYPE，我很难找到有关此的文档，我觉得这应该是可能的，因为这是一个相对简单的请求。有什么建议吗？

干杯

乔伊

这是我到目前为止的代码：

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = [very long list of websites]
  def parse(self, response):
    for sel in response.xpath(???):
      item = DanishItem()
      item['website'] = response
      item['DOCTYPE'] = sel.xpath('????').extract()
      yield item

新蜘蛛，检索 DOCTYPE，但由于某种原因会将我对指定 .json 文件的响应打印 15 次，而不是一次

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = ["http://wwww.example.com"]
  def parse(self, response):
    for sel in response.selector._root.getroottree().docinfo.doctype:
      el = response.selector._root.getroottree().docinfo.doctype
      item = DanishItem()
      item['website'] = response
      item['doctype'] = el
      yield item

由于scrapy使用lxml作为默认选择器，因此您可以使用response.selector句柄从lxml获取此信息，如下所示：

response.selector._root.getroottree().docinfo.doctype

这应该足够了，但如果您采用另一种方法，请继续阅读。

您应该能够使用 scrapy 的正则表达式提取器提取相同的信息：

response.selector.re("<!s*DOCTYPEs*(.*?)>")

但是，不幸的是，这将不起作用，因为lxml有一个相当可疑的行为(错误doctype？这就是为什么你不能直接从selector.re得到它。
您可以通过直接在response.body文本上使用 re 模块来轻松克服这个小障碍，该模块已正确序列化：

import re
s  = re.search("<!s*doctypes*(.*?)>", response.body, re.IGNORECASE)
doctype = s.group(1) if s else ""

更新：

至于你的另一个问题，原因如下。该行：

response.selector._root.getroottree().docinfo.doctype

返回string，而不是列表或类似的迭代器。因此，当您遍历它时，您基本上是在遍历该字符串中的字母。例如，如果您的 DOCTYPE 是 <!DOCTYPE html> ，则该字符串中有 15 个字符，这就是您的循环迭代 15 次的原因。您可以像以下方式进行验证：

for sel in response.selector._root.getroottree().docinfo.doctype:
    print sel

你应该让你的 DOCTYPE 字符串打印每行一个字符。

您应该做的是完全删除for循环，然后获取数据而不循环。此外，如果您打算通过item['website'] = response收集网站的URL，则应将其更改为：item['website'] = response.url 。所以它基本上：

def parse(self, response):
  doctype = response.selector._root.getroottree().docinfo.doctype
  item = DanishItem()
  item['website'] = response.url
  item['doctype'] = doctype
  yield item

相关内容

最新更新

热门标签：