BeautifulSoup HTTPResponse 没有属性编码

我正在尝试使用URL获得漂亮的汤，如下所示：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://proxies.org")
soup = BeautifulSoup(html.encode("utf-8"), "html.parser")
print(soup.find_all('a'))

但是，我收到一个错误：

 File "c:Python3ProxyList.py", line 3, in <module>
    html = urlopen("http://proxies.org").encode("utf-8")
AttributeError: 'HTTPResponse' object has no attribute 'encode'

知道为什么吗？可能与 urlopen 函数有关吗？为什么它需要 utf-8？

显然，与Python 3和BeautifulSoup4有一些差异，关于给出的示例（现在似乎已经过时或错误）......

它不起作用，因为urlopen返回一个HTTPResponse对象，而您将其视为直接HTML。您需要在响应上链接 .read() 方法才能获取 HTML：

response = urlopen("http://proxies.org")
html = response.read()
soup = BeautifulSoup(html.decode("utf-8"), "html.parser")
print (soup.find_all('a'))

您可能还想使用 html.decode("utf-8") 而不是 html.encode("utf-8") 。

检查这个。

soup = BeautifulSoup(html.read().encode('utf-8'),"html.parser")

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://proxies.org")
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('a'))

首先，urlopen将返回一个类似文件的对象
BeautifulSoup可以接受类似文件的对象并自动解码它，您不必担心它。

公文：

若要分析文档，请将其传递到 BeautifulSoup 构造函数中。您可以传入字符串或打开的文件句柄：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

首先，将文档转换为 Unicode，将 HTML 实体转换为 Unicode 字符

相关内容

最新更新

热门标签：