lxml 我无法解析,并且使用 etree.fromstring() 遇到很多不同的错误



我试图使用lxml的解析函数,每次出现不同的错误。我认为问题是在网站上,但当我试图使用它在谷歌维基百科和它也不管用!!

有人能帮帮我吗?如果每次出现不同的错误,是我的环境有问题还是程序有问题?

我正在运行这段代码:

driver.get('https://www.google.com.br/') 
data = driver.page_source
tree = etree.fromstring(data)

在google上出现这个错误:

File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 2, column 55

在维基百科上出现这个错误:

File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 19
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 19, column 57

对于我希望lxml工作的站点,出现了这个错误:

File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 5
lxml.etree.XMLSyntaxError: error parsing attribute name, line 5, column 301

如果有人知道lxml的替代方法,我也会很感激。

正如John Gordon所说,这不是xml,而是html,所以你必须将其解析为html。

试试这个:

from lxml import html
driver.get('https://www.google.com.br/') 
data = driver.page_source
tree = html.fromstring(data)

相关内容

  • 没有找到相关文章

最新更新