在大写字母上分隔连接单词



使用 Python,我必须编写一个本质上"清理"数据文本文件的脚本。到目前为止,我已经删除了所有不需要的字符或用可接受的字符替换它们(例如,破折号-可以用空格替换)。现在我已经到了必须将连接在一起的单词分开的地步。以下是文本文件前 15 行的片段

AccessibleComputing  Computer accessibility
AfghanistanHistory  History of Afghanistan
AfghanistanGeography  Geography of Afghanistan
AfghanistanPeople  Demographics of Afghanistan
AfghanistanCommunications  Communications in Afghanistan
AfghanistanMilitary  Afghan Armed Forces
AfghanistanTransportations  Transport in Afghanistan
AfghanistanTransnationalIssues  Foreign relations of Afghanistan
AssistiveTechnology  Assistive technology
AmoeboidTaxa  Amoeba
AsWeMayThink  As We May Think
AlbaniaHistory  History of Albania
AlbaniaPeople  Demographics of Albania
AlbaniaEconomy  Economy of Albania
AlbaniaGovernment  Politics of Albania

我想做的是将大写字母出现点连接的单词分开。例如,我希望第一行看起来像这样:

Accessible Computing  Computer accessibility

脚本必须采用文件输入并将结果写入输出文件。这是我目前拥有的,它根本不起作用!(不确定我是否走在正确的轨道上)

import re
input_file = open("C:\Users\Lucas\Documents\Python\pagelinkSample_10K_cleaned2.txt",'r')
output_file = open("C:\Users\Lucas\Documents\Python\pagelinkSample_10K_cleaned3.txt",'w')
for line in input_file:
    if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'):
        newline = line.
output_file.write(newline)
input_file.close()
output_file.close()

我建议使用以下正则表达式拆分单词:

import re, os
input_file = 'input.txt'
output_file = 'output.txt'
with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
        for line in f_in.readlines():
            p = re.compile(r'[A-Z][a-z]+|S+')
            matches = re.findall(p, line)
            matches = ' '.join(matches)
            f_out.write(matches+ os.linesep)

假设 data.txt 包含您粘贴到帖子中的文本,它将打印:

Accessible Computing Computer accessibility
Afghanistan History History of Afghanistan
Afghanistan Geography Geography of Afghanistan
Afghanistan People Demographics of Afghanistan
Afghanistan Communications Communications in Afghanistan
Afghanistan Military Afghan Armed Forces
Afghanistan Transportations Transport in Afghanistan
Afghanistan Transnational Issues Foreign relations of Afghanistan
Assistive Technology Assistive technology
Amoeboid Taxa Amoeba
As We May Think As We May Think
Albania History History of Albania
Albania People Demographics of Albania
Albania Economy Economy of Albania
Albania Government Politics of Albania
...

这不是最好的方法,但它很简单。

from string import uppercase
s = 'AccessibleComputing Computer accessibility'
>>> ' '.join(''.join(' ' + c if n and c in uppercase else c 
                     for n, c in enumerate(word)) 
             for word in s.split())
'Accessible Computing Computer accessibility'

顺便说一下,这是您应该如何进行文件读/写:

f_in = "C:\Users\Lucas\Documents\Python\pagelinkSample_10K_cleaned2.txt"
f_out = "C:\Users\Lucas\Documents\Python\pagelinkSample_10K_cleaned3.txt"
def func(line):
    processed_line = ... # your line processing function
    return processed_line
with open(f_in, 'r') as fin:
    with open(f_out, 'w') a fout:  
        for line in fin.readlines():
            fout.write(func(line))

你可以做:

re.sub(r'(?P<end>[a-z])(?P<start>[A-Z])', 'g<end> g<start>', line)
这将在每个相邻的小写

大写字母之间插入一个空格(假设您只有英文字符)。

最新更新