在本地HTML文件上使用粗糙的内置选择器

我有一些本地HTML文件，需要从中提取一些元素。我习惯于编写Scrapy，并使用其内置选择器xpath和css以及.extract()和.extract_first()提取元素。

有图书馆能做到这一点吗？

我检查了BeautifulSoup和lxml，但它们的语法与Scrapy不同。

例如，我想做这样的事情：

sample_file = "../raw_html_text/sample.html"
with open(sample_file, 'r', encoding='utf-8-sig', newline='') as f:
page = f.read()
html_object = # convert string to html or something
print(html_object.css("h2 ::text").extract_first())

我通常在其他项目中导入粗糙的选择器，因为我非常喜欢它们。只需导入Selector类并向其传递一个字符串，它就会像在Scrapy中一样工作。

from scrapy import Selector
sample_file = "../raw_html_text/sample.html"
with open(sample_file, 'r', encoding='utf-8-sig', newline='') as f:
page = f.read()
data = Selector(text=str(page))
title = data.css('h2::text').get()
# used to be data.css('h2::text').extract_first()

我知道你特别提到BeautifulSoup的语法与scrapy不同，但它绝对是这项工作的合适工具，而且它确实有一个传递CSS选择器的方法。

from bs4 import BeautifulSoup
sample_file = "../raw_html_text/sample.html"
with open(sample_file, 'r', encoding='utf-8-sig', newline='') as f:
page = f.read()
html_object = BeautifulSoup(page)
print(html_object.select("h2")[0].text)
# or print(html_object.select("div.container")[0].text) for div class="container", etc.

FWIW，访问输出也非常容易。select方法返回匹配的BeautifulSoup对象的列表。每个对象都有一个.text属性。

使用Parsel，这是Scrapy在下面使用的。

相关内容

最新更新

热门标签：