这是我拥有的数据,
s = [2, 8, 15, 23, 28, 43, 47, 55, 63, 72, 79, 82, 89, 97, 102, 112, 120, 125, 131, 141, 148, 156, 163, 167, 180, 188, 193, 210, 222, 227]
这些是我需要添加空格的索引位置,
d = 'CCCarbonCopyCAIComputerAidedInstructionCDMACodeDivisionMultipleAccessCRTCathodeRayTubeCADComputerAidedDesignCADDComputerAidedDesignDraftingCDCompactDiskCDRWCompactDiskRewritableCAMComputerAidedManufacturingCROMComputerizedRangeMotionCDROMCompactDiskReadOnlyMemory'
我的总体目标是以这样一种方式拆分字符串,使 Shotforms 和 Longform 分开,
对于EX:这是我试图获得的输出
CC Carbon Copy CAI Computer Aided Instruction
等等.....
我通过这样做计算了指数,
s = []
for i in range(0, len(d)):
if d[i].isupper() and d[i+1].islower() and d[i+2].islower():
s.append(i)
当我单独尝试使用索引添加空格时,我得到一个输出,
d[0:s[0]] + ' ' + d[s[0]:]
我得到以下内容,这是正确的
'CC CarbonCopyCAIComputerAidedInstructionCDMACodeDivisionMultipleAccessCRTCathodeRayTube
但是,当我尝试迭代索引时,我得到的列表超出了范围
temp = []
for i in s:
print(i)
temp.append(d[0:s[i]] + ' ' + d[s[i]:])
Traceback (most recent call last):
File "<input>", line 4, in <module>
IndexError: list index out of range
您可以直接存储要打印的字符,而不是存储索引,如下所示:
l = []
for i in range(0, len(d)):
if d[i].isupper() and d[i + 1].islower() and d[i + 2].islower():
l.append(" ")
l.append(d[i])
output = "".join(l)
但是,此代码仍然存在一些问题,似乎您错过了以下情况:"指令CDMA","访问CRT"... 此外,您可以获得一些IndexError: string index out of range
...
要修复这两个问题,您可以执行以下操作:
l = []
for i in range(0, len(d) - 2):
if d[i].isupper() and d[i + 1].islower() and d[i + 2].islower():
l.append(" ")
l.append(d[i])
if d[i].islower() and d[i + 1].isupper() and d[i + 2].isupper():
l.append(" ")
output = "".join(l)
您应该首先根据索引构建一个包含所有单词的列表。然后,join
函数将完成这项工作。
s = [2, 8, 15, 23, 28, 43, 47, 55, 63, 72, 79, 82, 89, 97, 102, 112, 120, 125, 131, 141, 148, 156, 163, 167, 180, 188, 193, 210, 222, 227]
d = 'CCCarbonCopyCAIComputerAidedInstructionCDMACodeDivisionMultipleAccessCRTCathodeRayTubeCADComputerAidedDesignCADDComputerAidedDesignDraftingCDCompactDiskCDRWCompactDiskRewritableCAMComputerAidedManufacturingCROMComputerizedRangeMotionCDROMCompactDiskReadOnlyMemory'
last_elem = 0
lst = []
for el in s:
lst.append(d[last_elem:el])
last_elem = el
' '.join(lst)
输出:'CC Carbon CopyCAI Computer Aided InstructionCDMA Code Division Multiple AccessCRT Cathode Ray TubeCAD Computer Aided DesignCADD Computer Aided Design DraftingCD Compact DiskCDRW Compact Disk RewritableCAM Computer Aided ManufacturingCROM Computerized Range'
您可以将字符串转换为列表,然后插入空格,然后再转换回来。
s = [2, 8, 15, 23, 28, 43, 47, 55, 63, 72, 79, 82, 89, 97, 102, 112, 120, 125, 131, 141, 148, 156, 163, 167, 180, 188, 193, 210, 222, 227]
d = 'CCCarbonCopyCAIComputerAidedInstructionCDMACodeDivisionMultipleAccessCRTCathodeRayTubeCADComputerAidedDesignCADDComputerAidedDesignDraftingCDCompactDiskCDRWCompactDiskRewritableCAMComputerAidedManufacturingCROMComputerizedRangeMotionCDROMCompactDiskReadOnlyMemory'
d = list(d)
for index,i in enumerate(s):
d.insert(index + i, " ")
d = ''.join(d)
print(d)
输出:
CC Carbon CopyCAI Computer Aided InstructionCDMA Code Division Multiple AccessCRT Cathode Ray TubeCAD Computer Aided DesignCADD Computer Aided Design DraftingCD Compact DiskCDRW Compact Disk RewritableCAM Computer Aided ManufacturingCROM Computerized Range MotionCDROMCompactDiskReadOnlyMemory
您可以使用正则表达式一次性完成所有操作,而无需先计算索引列表。
首先,我们将短形式和长形式部分分开,然后在长形式的单词之间添加空格。
我们以比您想要的更结构化的形式获取您的数据:
import re
d = 'CCCarbonCopyCAIComputerAidedInstructionCDMACodeDivisionMultipleAccessCRTCathodeRayTubeCADComputerAidedDesignCADDComputerAidedDesignDraftingCDCompactDiskCDRWCompactDiskRewritableCAMComputerAidedManufacturingCROMComputerizedRangeMotionCDROMCompactDiskReadOnlyMemory'
lst = re.findall(r'([A-Z]+)((?:[A-Z][a-z]+)+)', d)
# [('CC', 'CarbonCopy'),
# ('CAI', 'ComputerAidedInstruction'), ...
lst = [(abbr, re.sub(r'([a-z])(?=[A-Z])', r'1 ', long)) for abbr, long in lst]
print(lst)
# [('CC', 'Carbon Copy'),
# ('CAI', 'Computer Aided Instruction'), ...
如果你真的想把所有东西都放在一个字符串中,那么失去这个结构:
joined = ' '.join([item for tup in lst for item in tup])
print(joined)
# CC Carbon Copy CAI Computer Aided Instruction CDMA Code Division Multiple Access ...
您正在迭代列表元素,并且仍然通过索引访问它。这是错误的。
for i in s:
print(i) # this will iterate over the list, in that case you do not need indexing.
for i in range(len(s)):
print(s[i])
您需要了解两者之间的区别。 参考: https://www.geeksforgeeks.org/iterate-over-a-list-in-python/
对于您的问题,您可以尝试这个 -
words = []
for i in range(len(s)-1):
words.append(d[s[i]:s[i+1]])
" ".join(words)
也许不同的方法会更简单。根据数据的一致性,您可能能够re.split()
大写单词r'([A-Z][a-z]+)
并捕获分隔符。像这样:
import re
d = 'CCCarbonCopyCAIComputerAidedInstructionCDMACodeDivisionMultipleAccessCRTCathodeRayTubeCADComputerAidedDesignCADDComputerAidedDesignDraftingCDCompactDiskCDRWCompactDiskRewritableCAMComputerAidedManufacturingCROMComputerizedRangeMotionCDROMCompactDiskReadOnlyMemory'
out = " ".join(filter(len, re.split(r'([A-Z][a-z]+)', d)))
这将为您提供:
'CC Carbon Copy CAI Computer Aided Instruction CDMA Code Division Multiple Access CRT Cathode Ray Tube CAD Computer Aided Design CADD Computer Aided Design Drafting CD Compact Disk CDRW Compact Disk Rewritable CAM Computer Aided Manufacturing CROM Computerized Range Motion CDROM Compact Disk Read Only Memory'
这似乎适用于您给定的字符串,但如果您的数据不同,则可能会出现边缘情况。
请注意以下事项:
s = []
for i in range(0, len(d)):
if d[i].isupper() and d[i+1].islower() and d[i+2].islower():
s.append(i)
在这里,i
从0
到len(d)
.你认为当你打电话给d[len(d)+1]
时会发生什么?