网络刮擦 - 麦肯锡文章



我正在寻找文章标题。我不知道如何提取标题文本。您能看一下下面的我的代码并建议解决方案吗?

我是报纸的新手。感谢您的帮助!

网页Web开发人员视图的屏幕截图https://i.stack.imgur.com/bpn4w.jpg

import scrapy

class BrickSetSpider(scrapy.Spider):
    name = "brickset_spider"
    start_urls = ['https://www.mckinsey.com/search?q=Agile&start=1']
    def parse(self, response):
        for quote in response.css('div.text-wrapper'):
            item = {
                'text': quote.css('h3.headline::text').extract(),
            }
            print(item)
            yield item

对于新的到crapery开发人员来说看起来不错!我只更改您的 parse函数中的选择器:

for quote in response.css('div.block-list div.item'):
    yield {
        'text': quote.css('h3.headline::text').get(),
    }

upd:hm,看起来您的网站向数据提出了其他请求。

打开开发人员工具,并使用参数{"q":"Agile","page":1,"app":"","sort":"default","ignoreSpellSuggestion":false}检查https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search的请求。您可以使用这些参数和适当的标头制作scrapy.Request,并使用数据获得JSON。它将很容易用json lib。

upd2:正如我可以从此卷曲curl 'https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search' -H 'content-type: application/json' --data-binary '{"q":"Agile","page”:1,”app":"","sort":"default","ignoreSpellSuggestion":false}' --compressed中看到的那样,我们需要以这种方式提出请求:

from scrapy import Request
import json
data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
headers = {"content-type": "application/json"}
url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"
yield Request(url, headers=headers, body=json.dumps(data), callback=self.parse_api)

,然后在parse_api功能中仅解析响应:

def parse_api(self, response):
    data = json.loads(response.body)
    # and then extract what you need

因此,您可以在请求中迭代参数page并获取所有页面。

upd3:工作解决方案:

from scrapy import Spider, Request
import json

class BrickSetSpider(Spider):
    name = "brickset_spider"
    data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
    headers = {"content-type": "application/json"}
    url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"
    def start_requests(self):
        yield Request(self.url, headers=self.headers, method='POST',
                  body=json.dumps(self.data), meta={'page': 1})
    def parse(self, response):
        data = json.loads(response.body)
        results = data.get('data', {}).get('results')
        if not results:
            return
        for row in results:
            yield {'title': row.get('title')}
        page = response.meta['page'] + 1
        self.data['page'] = page
        yield Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'page': page})

如果您只想选择H1标签的文本,那么您要做的就是

[tag.css('::text').extract_first(default='') for tag in response.css('.attr')]

这是使用XPath,可能会更容易。

 //h1[@class='state']/text()

另外,我建议您查看Python的BeautifulSoup。在阅读整个页面和提取文本的整个HTML时,它非常容易且有效。https://www.crummy.com/software/beautifulsoup/bs4/doc/

一个非常简单的例子就是这样。

from bs4 import BeautifulSoup
text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())

最新更新