Beautiful Soup 无法识别 Python 3、IPython 6 控制台上的 UTF-8 编码

我正在尝试在Python 3.6.2，IPython 6.1.0，Windows 10上使用Beautiful Soup读取xml文档，但我无法正确编码。

这是我的测试 xml，以 UTF8 编码保存为文件：

<?xml version="1.0" encoding="UTF-8"?>
<root>
<info name="愛よ">ÜÜÜÜÜÜÜ</info>
<items>
<item thing="ÖöÖö">"23Äßßß"</item>
</items>
</root>

首先使用 ElementTree 检查 XML：

import xml.etree.ElementTree as ET
def printXML(xml,indent=''):
print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('n',''))
if len(xml.attrib) > 0:
for k,v in xml.attrib.items():
print(indent+'t'+k+' - '+v)
if xml.getchildren():
for child in xml.getchildren():
printXML(child,indent+'t')
xml0 = ET.parse("test.xml").getroot()
printXML(xml0)

输出正确：

root: 
info: ÜÜÜÜÜÜÜ
name - 愛よ
items: 
item: "23Äßßß"
thing - ÖöÖö

现在用美丽的汤阅读相同的文件并漂亮地打印它：

import bs4
with open("test.xml") as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
print(xml.prettify())

输出：

<!--?xml version="1.0" encoding="UTF-8"?-->
<html>
<head>
</head>
<body>
<root>
<info name="æ„›ã‚ˆ">
ÃœÃœÃœÃœÃœÃœÃœ
</info>
<items>
<item thing="Ã–Ã¶Ã–Ã¶">
"23Ã„ÃŸÃŸÃŸ"
</item>
</items>
</root>
</body>
</html>

这是错误的。使用指定的显式编码执行调用bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8")不会更改结果。

行为

print(xml.original_encoding)

输出

None

所以美丽的汤显然无法检测到原始编码，即使文件是用 UTF8 编码的(根据记事本++(，并且标头信息也显示 UTF-8，而且我确实按照文档的建议安装了chardet。

我在这里犯了错误吗？可能是什么原因造成的？

编辑：当我在没有html5lib的情况下调用代码时，我收到以下警告：

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). 
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, 
it may use a different parser and behave differently.
The code that caused this warning is on line 241 of the file C:UsersMy.NameAppDataLocalContinuumAnaconda2envsPython3libsite-packagesspyderutilsipythonstart_kernel.py. 
To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))

编辑2：

正如评论中所建议的，我尝试了bs4.BeautifulSoup(ff,"html.parser")，但问题仍然存在。

然后我安装了lxml并尝试了bs4.BeautifulSoup(ff,"lxml-xml")，仍然相同的输出。

同样让我感到奇怪的是，即使指定像bs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8')这样的编码，xml.original_encoding的值也与文档中编写的内容None相反。

编辑3：

我将我的 xml 内容放入字符串中

xmlstring = "<?xml version="1.0" encoding="UTF-8"?><root><info name="愛よ">ÜÜÜÜÜÜÜ</info><items><item thing="ÖöÖö">"23Äßßß"</item></items></root>"

并且用bs4.BeautifulSoup(xmlstring,"lxml-xml")，现在我得到了正确的输出：

<?xml version="1.0" encoding="utf-8"?>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>

所以看起来文件毕竟有问题。

发现错误，打开文件时必须指定编码：

with open("test.xml",encoding='UTF-8') as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")

当我使用 Python 3 时，我认为默认情况下encoding的值是UTF-8的，但事实证明它是系统依赖的，并且在我的系统上它是cp1252的。

相关内容

最新更新

热门标签：