在python中使用正则表达式re.split拆分堆叠实体

我在将连续字符串拆分为更合理的部分时遇到问题：

例如，"MarieMüller"应该变成"Marie Müller"

到目前为止，我已经使用了这个，如果没有出现特殊字符，它就可以工作：

' '.join([a for a in re.split(ur'([A-Z][a-z]+)', ''.join(entity)) if a])

例如，输出"TinaTurner"->"Tina Turner"，但不起作用对于"MarieMüller"，输出："MarieMüller"-> "Marie M \utf8 ller"

现在我使用 regex \p{L} 来了：

 ' '.join([a for a in re.split(ur'([p{Lu}][p{Ll}]+)', ''.join(entity)) if a])

但这会产生奇怪的事情，例如："詹妮弗劳伦斯"->"詹妮弗·

谁能帮我一把？

如果你使用 Unicode 并且需要使用 Unicode 类别，你应该考虑使用 PyPi 正则表达式模块。在那里，您可以支持所有 Unicode 类别：

>>> import regex
>>> p = regex.compile(ur'(?<=p{Ll})(?=p{Lu})')
>>> test_str = u"Tina TurnernMarieMu00FCllernJaceku0104cki"
>>> result = p.sub(u" ", test_str)
>>> result
u'Tina TurnernMarie MxfcllernJacek u0104cki'
      ^             ^                ^

在这里，(?<=p{Ll})(?=p{Lu})正则表达式查找小写（p{Ll}）和大写（p{Lu}）字母之间的所有位置，然后regex.sub在那里插入一个空格。请注意，如果模式是 Unicode 字符串（u前缀），则正则表达式模块会自动编译带有regex.UNICODE标志的正则表达式。

它不适用于扩展字符

您可以使用re.sub()为此。它会简单得多

(?=(?!^)[A-Z])

用于处理空间

print re.sub(r'(?<=[^s])(?=(?!^)[A-Z])', ' ', '   Tina Turner'.strip())

用于处理连续大写字母的情况

print re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', '   TinaTXYurner'.strip())

Ideone 演示

正则表达式细分

(?= #Lookahead to find all the position of capital letters
 (?!^) #Ignore the first capital letter for substitution
 [A-Z]
)

使用由 Python 的字符串操作而不是正则表达式构造的函数，这应该可以工作：

def split_combined_words(combined):
    separated = [combined[1]]
    for letter in combined[1:]:
        print letter
        if (letter.islower() or (letter.isupper() and separated[-1].isupper())):
            separated.append(letter)
        else:
            separated.extend((" ", letter))
    return "".join(separated)

相关内容

最新更新

热门标签：