拆分除 Unicode 文件中的数字以外的新行

我有utf_8 .txt文件：希腊语.txt

Blessed is a Man
1. μακάριος
ανήρ
2. ότι
γινώσκει
κύριος

我想得到：greek_r.txt

Blessed is a Man
1. μακάριος ανήρ
2. ότι γινώσκει κύριος

我用了

# -*- coding: utf-8 -*-
import io
import re
f1 = io.open('greek.txt','r',encoding='utf8')
f2 = io.open('greek_r.txt','w',encoding='utf8')
for line in f1:
f2.write(re.sub(r'n((?=^[^d]))', r'1', line))
f1.close()
f2.close()

但是不起作用，知道吗？

您正在逐行读取输入文件，因此，您的正则表达式无法跨行"看到"，n是每行中的最后一个字符，(?=^[^d])毫无意义，因为它需要字符串的开头，后跟数字以外的字符。

使用类似以下内容：

import re, io
with io.open('greek.txt','r',encoding='utf8') as f1:
with io.open('greek_r.txt','w',encoding='utf8') as f2:
f2.write(re.sub(r'r?n(D)', r' 1', f1.read()))

添加r?以匹配可选的 CR 符号(如果换行符为 Windows 样式(。r'r?n(D)'可以替换为r'(?u)r?n([^Wd_])'，以仅匹配后跟字母的换行符([^Wd_]匹配除非单词、数字和_字符以外的任何字符，即任何字母(。(?u)是一个内联re.U修饰符版本，用于匹配Python 2.x中的任何Unicode字母(在Python 3中，默认情况下使用它(。

输出：

Blessed is a Man
1. μακάριος ανήρ
2. ότι γινώσκει κύριος

相关内容

最新更新

热门标签：