html5lib:TypeError: __init__() 'encoding' 得到一个意外的关键字参数



我正在尝试安装html5lib。一开始我尝试安装最新版本(8或9个9),但它与我的BeautifulSoup发生冲突,所以我决定尝试旧版本(0.9999999,7个9)。我安装了它,但是当我尝试使用它时:

>>> with urlopen("http://example.com/") as f:
    document = html5lib.parse(f, encoding=f.info().get_content_charset())

我得到一个错误:

Traceback (most recent call last):
  File "<pyshell#11>", line 2, in <module>
    document = html5lib.parse(f, encoding=f.info().get_content_charset())
  File "C:PythonPython35-32libsite-packageshtml5libhtml5parser.py", line 35, in parse
    return p.parse(doc, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5libhtml5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5libhtml5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5lib_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5lib_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

出了什么问题,我该怎么办?

我看到最新版本的html5lib在bs4, html5lib.treebuilders中出现了一些问题。_base不再存在,使用bs4 4.4.1的最新兼容版本似乎是一个有7个9,一旦你安装它如下,它工作正常:

 pip3 install -U html5lib=="0.9999999"

使用bs4 4.4.1测试:

In [1]: import bs4
In [2]: bs4.__version__
Out[2]: '4.4.1'
In [3]: import html5lib
In [4]: html5lib.__version__
Out[4]: '0.9999999'
In [5]: from urllib.request import  urlopen
In [6]: with urlopen("http://example.com/") as f:
   ...:         document = html5lib.parse(f, encoding=f.info().get_content_charset())
   ...:     
In [7]: 

你可以在这个提交中看到更改。_base变为.base以反映公共状态。

你看到的错误是因为你还在使用最新版本,在html5lib/_inputstream.py中,HTMLBinaryInputStream没有编码参数g:

class HTMLBinaryInputStream(HTMLUnicodeInputStream):
    """Provides a unicode stream of characters to the HTMLTokenizer.
    This class takes care of character encoding and removing or replacing
    incorrect byte-sequences and also provides column and line tracking.
    """
    def __init__(self, source, override_encoding=None, transport_encoding=None,
                 same_origin_parent_encoding=None, likely_encoding=None,
                 default_encoding="windows-1252", useChardet=True):

设置override_encoding=f.info().get_content_charset()应该可以达到这个效果。

升级到最新版本的bs4也可以使用最新版本的html5lib:

In [16]: bs4.__version__
Out[16]: '4.5.1'
In [17]: html5lib.__version__
Out[17]: '0.999999999'
In [18]: with urlopen("http://example.com/") as f:
             document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
   ....:     
In [19]: 

相关内容

  • 没有找到相关文章

最新更新