lxml.html找不到body标签

我使用lxml.html来解析各种HTML页面。现在，我认识到，至少对于某些页面，尽管存在，但它还是没有找到身体标签(即使使用LXML作为解析器(。

(。

示例页面：https：//plus.google.com/(剩下的它(

import lxml.html
import bs4
html_string = """
    ... source code of https://plus.google.com/ (manually copied) ...
"""
# lxml fails (body is None)
body = lxml.html.fromstring(html_string).find('body')
# Beautiful soup using lxml parser succeeds
body = bs4.BeautifulSoup(html_string, 'lxml').find('body')

欢迎在这里发生的任何猜测：(

更新：

问题似乎与编码有关。

# working version
body = lxml.html.document_fromstring(html_string.encode('unicode-escape')).find('body')

您可以使用类似的东西：

import requests
import lxml.html
html_string = requests.get("https://plus.google.com/").content
body = lxml.html.document_fromstring(html_string).find('body')

身体变量包含身体HTML元素

相关内容

最新更新

热门标签：