使用 lxml 解析 html:解码错误



http://www.findmice.org/repository

$ file /tmp/repository.html
/tmp/repository.html: HTML document text, ISO-8859 text

我正在尝试通过以下python代码解析上述文件。

from lxml import html
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='iso-8859-1'))

但是我得到了以下错误。

Traceback (most recent call last):
File "../imsrrepo.py", line 14, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='iso-8859-1'))
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3467, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 3597: invalid start byte
with open(file, "r", encoding="iso-8859-1") as html_file:
doc = html.parse(html_file)

我认为lxml从未说过lxml.html.parse支持编码方法。也许隐藏在源代码中,但他们从未在文档中说明过。也有可能是 lxml Http 解析器无法处理"iso-8859-1",或者 META 中声明的编码与实际编码不匹配。

也许有人可以为您提供更好的解释,但据我所知,我们应该使用正确的编码或使用BeautifulSoup来阅读它。这在文档中有所说明,

但是,请注意,网页最常见的问题是缺乏 (或存在不正确的(编码声明。是的 因此通常只使用编码检测就足够了 BeautifulSoup,称为UnicodeDammit,其余的留给lxml的 自己的HTML解析器,速度快几倍。

相关内容

最新更新