如何通过编号拆分字符串



我想把以下语料库分成几个部分:

corpus = '1  Write short notes on the anatomy of the Circle of Willis including normal variants.     2  Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.      3  Write short notes on the anatomy of the axis (C2 vertebra).      4  Write short notes on the anatomy of the corpus callosum.      5  Write short notes on the anatomy of the posterior division of the internal iliac artery  6  Write short notes on the anal canal including sphincters.               
'

进入以下:

['Write short notes on the anatomy of the Circle of Willis including normal variants.', 'Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.', 'Write short notes on the anatomy of the axis (C2 vertebra).', 'Write short notes on the anatomy of the posterior division of the internal iliac artery', 'Write short notes on the anal canal including sphincters.']

我写了这个,但不起作用:

for i in [int(s) for s in corpus.split() if s.isdigit()]:
answer = corpus.split(str(i))
print(answer)

我能做什么?

对于您的示例数据,您还可以匹配零倍或多倍的空白,后跟一个或多个数字,以及2倍的空白以进行拆分:

*d+

print (filter(None, re.split(' *d+  ', corpus)))

演示

为了清楚起见,您可以将空白放在一个字符类中,后面跟一个量词[ ]*d+[ ]{2}

您标记了regex,但提供了非regex解决方案。这是适用于您的OP的非正则表达式正确解决方案。

在空格上进行拆分是可以的,然后将文本部分累积到一个临时变量中,直到遇到下一个数字,然后将临时部分添加到整体结果中。

使用列表存储临时性(部分(比附加到字符串更有效,因为它具有不变性。

跳过存储数字本身:

corpus = '1  Write short notes on the anatomy of the Circle of Willis including normal variants.     2  Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.      3  Write short notes on the anatomy of the axis (C2 vertebra).      4  Write short notes on the anatomy of the corpus callosum.      5  Write short notes on the anatomy of the posterior division of the internal iliac artery  6  Write short notes on the anal canal including sphincters.'               
allparts = []  # total result
part = []      # parts that belong to one number
for p in corpus.split():
if p.isdigit():      # if a number
if part:             # if stored something
allparts.append(' '.join(part))   # add it to result
part=[]
continue         # skip storing the number  
part.append(p)      # add to part
if part:   # add rest
allparts.append(' '.join(part))
print(allparts)

输出:

['Write short notes on the anatomy of the Circle of Willis including normal variants.', 
'Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.', 
'Write short notes on the anatomy of the axis (C2 vertebra).', 
'Write short notes on the anatomy of the corpus callosum.', 
'Write short notes on the anatomy of the posterior division of the internal iliac artery', 
'Write short notes on the anal canal including sphincters.']

使用re.split和列表理解,使用str.strip删除最终空格:

import re
result = [
phrase for phrase in map(str.strip, re.split('d+ss', corpus)) if phrase
]

结果:

['Write short notes on the anatomy of the Circle of Willis including normal variants.',
'Write short notes on the anatomy of the radiological spaces of the orbit excluding the eyeball.',
'Write short notes on the anatomy of the axis (C2 vertebra).',
'Write short notes on the anatomy of the corpus callosum.',
'Write short notes on the anatomy of the posterior division of the internal iliac artery',
'Write short notes on the anal canal including sphincters.']

尝试将re.split((与正则表达式+strip((一起使用

a = "1  hello.  2  my name is. 3  maat."
answer = [s.strip(" ") for s in filter(None, re.split(" *d+ ", a))]
print(answer) #['hello.', 'my name is.', 'maat.']

re.split((与split(/条形图("(从中删除空间

最新更新