在unicode代码点列表中查找连续范围



我有一个unicode代码点的列表,沿着这些行(不是一个实际的集合,只是问题说明(:

uni050B
uni050C
uni050D
uni050E
uni050F
uni0510
uni0511
uni0512
uni0513
uni1E00
uni1E01
uni1E3E
uni1E3F
uni1E80
uni1E81
uni1E82
uni1E83
uni1E84
uni1E85
uni1EA0
and so forth…

我需要找到这些的unicode-range。该集合的某些部分是连续的,缺少一些点,因此范围不是U+050B-1EA0。有没有一种合理的方法来提取那些连续的";子范围";?

我什么都不知道"现成的";但是计算起来足够简单。以下查找连续数字并使用Python:构建unicode-range

import re
def build_range(uni):
'''Pass a list of sorted positive integers to include in the unicode-range.
'''
uni.append(-1) # sentinel prevents having to special case the last element
start,uni = uni[0],uni[1:]
current = start
strings = []
for u in uni:
if u == current: # in case of duplicates
continue
if u == current + 1: # in a consecutive range...
current = u
elif start == current: # single element
strings.append(f'U+{current:X}')
start = current = u
else: # range
strings.append(f'U+{start:X}-{current:X}')
start = current = u

return 'unicode-range: ' + ', '.join(strings) + ';'
data = '''
uni050B
uni050C
uni050D
uni050E
uni050F
uni0510
uni0511
uni0512
uni0513
uni1E00
uni1E01
uni1E3E
uni1E3F
uni1E80
uni1E81
uni1E82
uni1E83
uni1E84
uni1E85
uni1EA0'''
# parse out the hexadecimal values into an integer list
uni = sorted([int(x,16) for x in re.findall(r'uni([0-9A-F]{4})',data)])
print(build_range(uni))

输出:

unicode-range: U+50B-513, U+1E00-1E01, U+1E3E-1E3F, U+1E80-1E85, U+1EA0;

最新更新