>我有一个输入文件,而我必须根据 2 个空白新行提取几行。
例如:文本文件如下所示。
1. Sometext
Sometext
Sometext
2. Sometext
Sometext
Sometext
3. Sometext
Sometext
Sometext
Sometext which is not needed
Sometext which is not needed
Sometext which is not needed
我必须提取一个子字符串,说明从"1."到"2"之前的所有。以及从"2."到"3."之前的所有子字符串的第二个子字符串,依此类推。我有下面的脚本,它得到输出,但它也得到所有我不想要的"一些不需要的文本"。请参阅下面的代码:
file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
a=1
b=2
end =0
ext =0
while (start<size):
if (end !=-1):
subString = content[content.find(str(a)+".")+0:content.find("n"+str(b)+".")]
print (subString)
end = content.find(str(b)+".",start)
print ("n")
a = int(a)+1 # increment to find the next start number
b = int(b)+1 # increment to find the next end number
start = end+1 # continuing to search the next
else:
break
因此,我决定为结束位置找到 2 个连续的空白行,并使用下面的一个,但这不起作用。
subString = content[content.find (str(a)+".")+3:content.find("nn")]
如果您有任何问题,请提供帮助并告诉我。提前谢谢你。
我不确定我是否正确理解了您的问题,但这是将输出的代码:
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
基于您问题中的文本。相反,如果您希望 1 到 2 是像这样的整个子字符串:
['1. SometextnSometextnSometext']
['2. SometextnSometextnSometext']
['3. SometextnSometextnSometext']
应将 if 语句更改为:
if is_number(i[0]):
substring = []
substring.append(i)
print(substring)
否则您可以使用下面的代码
def is_number(string):
try:
float(string)
return True
except ValueError:
return False
with open('testing.txt', 'r') as f:
content = f.read().split('nn')
for i in content:
if is_number(i[0]):
c = i.split('n')
substring = [line[3:] if is_number(line[0]) else line for line in c]
print(substring)
您必须在最后过滤不需要的行,但这会得到您想要的:
from itertools import groupby
with open("in.txt") as f:
grps = groupby(f, key=lambda x: bool(x.strip()))
print([list(v) for k,v in grps if k])
输出:
[['1. Sometextn', 'Sometextn', 'Sometextn'], ['2. Sometextn', 'Sometextn', 'Sometextn'], ['3. Sometextn', 'Sometextn', 'Sometextn'], ['Sometext which is not neededn', 'Sometext which is not neededn', 'Sometext which is not needed']]
由于您要保留的所有部分都以数字开头:
from itertools import groupby, takewhile
with open("in.txt") as f:
grps = groupby(f, key=lambda x: bool(x.strip()))
print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))
输出:
[['1. Sometextn', 'Sometextn', 'Sometextn'],
['2. Sometextn', 'Sometextn', 'Sometextn'],
['3. Sometextn', 'Sometextn', 'Sometextn']]
如果您知道有n
组,则可以切片:
from itertools import groupby, islice
with open("in.txt") as f:
grps = groupby(f, key=lambda x: bool(x.strip()))
print (list(islice((list(v) for k,v in grps if k),3)))
输出:
[['1. Sometextn', 'Sometextn', 'Sometextn'],
['2. Sometextn', 'Sometextn', 'Sometextn'],
['3. Sometextn', 'Sometextn', 'Sometextn']]