特殊字符的Regex(S，顶部有一行)

我试图用Python编写regex，用下划线替换所有非ascii，但如果其中一个字符是"CCD_ 1"；(顶部有一行的"S"(，它增加了一个额外的"S。。。有没有办法解释这个角色？我相信它是一个有效的utf-8字符，但不是ascii

这里有代码：

import re
line = "ra*ndom wordS̄"
print(re.sub('[W]', '_', line))

我希望它能输出：

ra_ndom_word_

但我得到的却是：

ra_ndom_wordS__

Python以这种方式工作的原因是，您实际上看到的是两个不同的字符；有一个S，然后它后面是一个组合的macron U+0304

在一般情况下，如果您想用下划线替换组合字符和基本字符的序列，请尝试例如

import unicodedata
def cleanup(line):
cleaned = []
strip = False
for char in line:
if unicodedata.combining(char):
strip = True
continue
if strip:
cleaned.pop()
strip = False
if unicodedata.category(char) not in ("Ll", "Lu"):
char = "_"
cleaned.append(char)
return ''.join(cleaned)

顺便说一下，W不需要在其周围使用方括号；它已经是一个regex字符类了。

Python的re模块不支持重要的Unicode属性，尽管如果您真的想为此专门使用正则表达式，第三方regex库对Unicode类别有适当的支持。

"Ll"是小写字母，"Lu"是大写字母。还有其他Unicode L类别，所以可能会调整它以满足您的需求(可能是unicodedata.category(char).startswith("L")？(；另请参阅https://www.fileformat.info/info/unicode/category/index.htm

您可以使用以下脚本来获得所需的输出：

import re
line="ra*ndom wordS̄"
print(re.sub('[^[-~]+]*','_',line))

输出

ra_ndom_word_

在这种方法中，它也适用于其他非ascii字符：

import re
line="ra*ndom ¡¢£Ä wordS̄.  another non-ascii: Ä and Ï"
print(re.sub('[^[-~]+]*','_',line))

输出：

ra_ndom_word_another_non_ascii_and_

相关内容

最新更新

热门标签：