任意长的二进制字符串的UTF-16表示形式



我有二进制字符串,希望尽可能紧凑地存储它们(就磁盘空间而言)。它们可以是形式为"011010010110100110010101"的1到~1000位之间。将它们存储为"TEXT"是浪费。我想检索并将它们转换回原始二进制字符串。SQLite的TEXT类型可以是UTF-8、UTF-16BE或UTF-16LE。

的例子:

"0110101011011000001101010010111001101010011101101010100011001101001001010110010010010101010100">

转换为:

'櫘㔮橶꣍锔'

用于存储在数据库中,并转换回原始二进制字符串

我尝试了许多解决方案,但要么二进制字符串格式错误,要么字符以某种格式输出,如'xffxfe0x001',这仍然是浪费的,或者只适用于8或16位的倍数。

将二进制字符串转换为整数并跟踪位数。一个BLOB字节的数据可以作为SQLITE INTEGER的大小来存储,如果你有数千个比特,就会超过这个大小。

的例子:

import sqlite3
import os
import math
# store string of binary data packed into bytes
def insert(cur, binary):
bit_count = len(binary)
byte_count = math.ceil(bit_count / 8)
byte_data = int(binary, 2).to_bytes(byte_count, 'little')
cur.execute("insert into test values(?, ?)", (bit_count, byte_data))
# set up database
con = sqlite3.connect(':memory:')
cur = con.cursor()
cur.execute('create table test(bit_count, binary)')
con.commit()
# insert some strings
for binary in ['101', '000011', '101010111011011011110001', '0110' * 250]:
insert(cur, binary)
# retrieve, unpack and display
res = cur.execute('select * from test')
for bit_count, byte_data in res:
integer = int.from_bytes(byte_data, 'little')
binary = f'{integer:0{bit_count}b}'
print(binary)
输出:

101
000011
101010111011011011110001
0110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110

最新更新