如何在 Python 中使用正则表达式和 re.sub 查找 unicode 字符的所有大写和小写出现

这是我在 django 视图中的代码（有意简化）（Python 2.7）：

# -*- coding: utf-8 -*-
from django.shortcuts import render
import re
def index(request):
    found_verses = [] 
    pattern = re.compile('ю')
    with open('d.txt', 'r') as doc:
        for line in doc:
            found = pattern.search(line)
            if found:
                modified_line = pattern.sub('!'+'g<0>'+'!',line)
                found_verses.append(modified_line)
context = {'found_verses': found_verses}
return render(request, 'myapp/index.html', context)

d.txt（也是 utf-8）包含这一行（有意简化）：

1. Я сказал Юлию одному.

上面的渲染后，给了我预期的结果：

1. Я сказал Юли!ю! одному.

当我更改为大写字母pattern = re.compile('Ю')时，它也给了我预期的结果：

1. Я сказал !Ю!лию одному.

但是当我变成一个小组pattern = re.compile('[юЮ]')或pattern = re.compile('[Юю]')或pattern = re.compile('[ю]')或pattern = re.compile('[Ю]')时，它什么也没给我。我想得到的是：

1. Я сказал !Ю!ли!ю! одному.

请帮助我得到这个结果。我已经挣扎了一天多，尝试了不同的配置，如pattern = re.compile('[юЮ]', re.UNICODE)和pattern = re.compile('ю', re.UNICODE|re.I)这个和无数其他配置，但都是徒劳的。

使用 unicode s。

with io.open('d.txt', 'r', encoding='utf-8') as doc:
   ...

。

pattern = re.compile(u'[юЮ]', re.UNICODE)

只是一个猜测，但试试这个

with open('d.txt', 'rb') as doc: #I guess you probably dont need the b flag for utf8 but meh
        for line in doc:
            line = line.decode("utf8")
             ...

问题可能是您使用的是常规字符串，而不是 unicode 字符串。re库需要知道如何处理 RE 中的字节。尝试

re.compile(u'ю')

（请注意，这是@Ignacio在他的回答中的做法）。

相关内容

最新更新

热门标签：