我试图使用lxml的解析函数,每次出现不同的错误。我认为问题是在网站上,但当我试图使用它在谷歌维基百科和它也不管用!!
有人能帮帮我吗?如果每次出现不同的错误,是我的环境有问题还是程序有问题?
我正在运行这段代码:
driver.get('https://www.google.com.br/')
data = driver.page_source
tree = etree.fromstring(data)
在google上出现这个错误:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 2, column 55
在维基百科上出现这个错误:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 19
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 19, column 57
对于我希望lxml工作的站点,出现了这个错误:
File "srclxmletree.pyx", line 3257, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1796, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 728, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 5
lxml.etree.XMLSyntaxError: error parsing attribute name, line 5, column 301
如果有人知道lxml的替代方法,我也会很感激。
正如John Gordon所说,这不是xml,而是html,所以你必须将其解析为html。
试试这个:
from lxml import html
driver.get('https://www.google.com.br/')
data = driver.page_source
tree = html.fromstring(data)