计算python中unicode字符的字节数

我正在编写一个Python脚本，从文件中读取Unicode字符并将其插入数据库。每个字符串只能插入30个字节。在插入数据库之前，我如何计算字符串的大小?

如果你需要知道字节数(文件大小)，那么只需调用
bytes_count = os.path.getsize(filename) .

如果你想知道一个Unicode字符可能需要多少字节，那么它取决于字符编码:

>>> print(u"N{EURO SIGN}")
€
>>> u"N{EURO SIGN}".encode('utf-8') # 3 bytes
'xe2x82xac'
>>> u"N{EURO SIGN}".encode('cp1252') # 1 byte
'x80'
>>> u"N{EURO SIGN}".encode('utf-16le') # 2 bytes
'xac '

要找出一个文件包含多少Unicode字符，您不需要一次读取内存中的整个文件(如果它是一个大文件):

with open(filename, encoding=character_encoding) as file:
    unicode_character_count = sum(len(line) for line in file)

如果使用Python 2，则在顶部添加from io import open。

相同的人类可读文本的确切计数可能取决于Unicode规范化(不同的环境可能使用不同的设置):

>>> import unicodedata
>>> print(u"u212b")
Å
>>> unicodedata.normalize("NFD", u"u212b") # 2 Unicode codepoints
u'Au030a'
>>> unicodedata.normalize("NFC", u"u212b") # 1 Unicode codepoint
u'xc5'
>>> unicodedata.normalize("NFKD", u"u212b") # 2 Unicode codepoints
u'Au030a'
>>> unicodedata.normalize("NFKC", u"u212b") # 1 Unicode codepoint
u'xc5'

如示例所示，单个字符(Å)可以使用多个Unicode码点表示。

要找出文件中有多少用户感知的字符，您可以使用X正则表达式(计数扩展字形簇):

import regex # $ pip install regex
with open(filename, encoding=character_encoding) as file:
    character_count = sum(len(regex.findall(r'X', line)) for line in file)

的例子:

>>> import regex
>>> char = u'Au030a'
>>> print(char)
Å
>>> len(char)
2
>>> regex.findall(r'X', char)
['Å']
>>> len(regex.findall(r'X', char))
1

假设您正在将unicode字符从文件中读取到一个名为byteString的变量中。然后您可以执行以下操作:

unicode_string = byteString.decode("utf-8")
print len(unicode_string)

相关内容

最新更新

热门标签：