即使使用"fromEncoding=UTF-8"，BeautifulSoup 也无法识别 UTF-8 字符

我写了一个简单的脚本，只需要获取一个网页并将其内容提取到一个标记化列表中。然而，我遇到了一个问题，当我将BeautifulSoup对象转换为String时，"、"等的UTF-8字符不会转换。相反，它们仍然是unicode格式。

当我创建BeautifulSoup对象时，我将源定义为UTF-8，我甚至尝试过单独运行unicode转换，但都不起作用。有人知道为什么会发生这种事吗？

from urllib2 import urlopen
from bs4 import BeautifulSoup
import nltk, re, pprint
url = "http://www.bloomberg.com/news/print/2013-07-05/softbank-s-21-6-billion-bid-for-    sprint-approved-by-u-s-.html"
raw = urlopen(url).read()
soup = BeautifulSoup(raw, fromEncoding="UTF-8")
result = soup.find_all(id="story_content")
str_result = str(result)
notag = re.sub("<.*?>", " ", str_result)
output = nltk.word_tokenize(notag)
print(output)

您遇到问题的字符不是"（U+0022）和'（U+0027），而是大引号“（U+201C）和”（U+201D）以及’（U+2019）。首先将它们转换为直接版本，你应该会得到你期望的结果：

raw = urlopen(url).read()
original = raw.decode('utf-8')
replacement = original.replace('u201c', '"').replace('u201d', '"').replace('u2019', "'")
soup = BeautifulSoup(replacement)  # Don't need fromEncoding if we're passing in Unicode

这应该会让引号字符变成你期望的形式。

相关内容

最新更新

热门标签：