Python 2 Re GEGEX操作来自RE包装不处理UTF-8变量编码

我在utf-8编码的文本文件上执行一系列的正则操作，该文件包含包含字母和非字母字符的行列表，其中包括具有变节的非拉丁语字符。这是文件中的片段（注意非拉丁字符）：

oro[=]sia[=]łeś
oszust[=]ką

我的脚本首先打开文本文件，读取每行并剥离不必要的字符。然后，我的正则操作首先捕获一个与指定模式匹配的单词，然后插入调整非按字符的位置[=]的位置。这是我脚本的片段：

# -*- coding: utf-8 -*-
import re
with open(r'...input.txt', "rb") as input, open(r'...output.txt', "wb") as output:
for line in input:
    word = line.strip('rn')
    # Rule 1: ^VCV -> V[=]CV
    match = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy].*(.*[=].*)*', word)
    result = match.group() if match else None
    if result == word:
        word = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', word)
     outLine = word + "n"        
     errorList.write(outLine)

该规则似乎失败了，其输入的规则环境涉及具有变节的非拉丁字符。例如，当上述规则1的输入是'oszust[=]ką'时，re.match.group()将其重新编码为'oszust[=]kxc4'。转换最后一个字符会更改环境并匹配以下正则操作的输入。

问题显然在于utf-8编码，因为该脚本设法处理oro[=]sia[=]łeś，其中规则环境不包含具有变量的字符，就可以了。阅读了此网站后，我尝试将输入重新编码到utf-8，以使其符合正则操作的环境，但是我会收到此错误：

'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128)

如果我想将其编码为utf-8，为什么该错误提及ascii？我如何修改编码以使其满足正则操作所需的环境？

处理Unicode字符时，请使用Unicode字符串。在您程序的I/O边界处转换为Unicode字符串。如果可能的话，切换到最新的Python 3。它可以更好地处理Unicode。

# -*- coding: utf-8 -*-
import re
import io
with io.open('input.txt', 'r', encoding='utf8') as input, 
    io.open('output.txt', 'w', encoding='utf8') as output:
    for line in input:
        word = line.strip()  # this will remove all leading/trailing whitespace.
        # Rule 1: ^VCV -> V[=]CV
        match = re.match(u'^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy].*(.*[=].*)*', word)
        result = match.group() if match else None
        if result == word:
            word = re.sub(u'(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', u'[=]', word)
        outLine = word + u'n'        
        output.write(outLine)

相关内容

最新更新

热门标签：