Python - 从文本文件中提取字符串,直到前 2 个新行空格



>我有一个输入文件,而我必须根据 2 个空白新行提取几行。

例如:文本文件如下所示。

1. Sometext
Sometext 
Sometext
2. Sometext
Sometext
Sometext
3. Sometext
Sometext
Sometext
Sometext which is not needed
Sometext which is not needed
Sometext which is not needed

我必须提取一个子字符串,说明从"1."到"2"之前的所有。以及从"2."到"3."之前的所有子字符串的第二个子字符串,依此类推。我有下面的脚本,它得到输出,但它也得到所有我不想要的"一些不需要的文本"。请参阅下面的代码:

file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
a=1
b=2
end =0
ext =0   
while (start<size):
   if (end !=-1):
   subString = content[content.find(str(a)+".")+0:content.find("n"+str(b)+".")] 
   print (subString)
   end = content.find(str(b)+".",start)
                print ("n")
                a = int(a)+1 # increment to find the next start number
                b = int(b)+1 # increment to find the next end number
                start = end+1 # continuing to search the next
            else:
                break

因此,我决定为结束位置找到 2 个连续的空白行,并使用下面的一个,但这不起作用。

subString = content[content.find (str(a)+".")+3:content.find("nn")]

如果您有任何问题,请提供帮助并告诉我。提前谢谢你。

我不确定我是否正确理解了您的问题,但这是将输出的代码:

['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']

基于您问题中的文本。相反,如果您希望 1 到 2 是像这样的整个子字符串:

['1. SometextnSometextnSometext']
['2. SometextnSometextnSometext']
['3. SometextnSometextnSometext']

应将 if 语句更改为:

if is_number(i[0]):
            substring = []
            substring.append(i)
            print(substring)

否则您可以使用下面的代码

def is_number(string):
    try:
        float(string)
        return True
    except ValueError:
        return False
with open('testing.txt', 'r') as f:
content = f.read().split('nn')
for i in content:
    if is_number(i[0]):
        c = i.split('n')
        substring = [line[3:] if is_number(line[0]) else line for line in c]
        print(substring)

您必须在最后过滤不需要的行,但这会得到您想要的:

from itertools import groupby
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print([list(v) for k,v in grps if k])

输出:

[['1. Sometextn', 'Sometextn', 'Sometextn'], ['2. Sometextn', 'Sometextn', 'Sometextn'], ['3. Sometextn', 'Sometextn', 'Sometextn'], ['Sometext which is not neededn', 'Sometext which is not neededn', 'Sometext which is not needed']]

由于您要保留的所有部分都以数字开头:

from itertools import groupby, takewhile
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))

输出:

[['1. Sometextn', 'Sometextn', 'Sometextn'],
 ['2. Sometextn', 'Sometextn', 'Sometextn'],
['3. Sometextn', 'Sometextn', 'Sometextn']]

如果您知道有n组,则可以切片:

from itertools import groupby, islice
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(islice((list(v) for k,v in grps if k),3)))

输出:

[['1. Sometextn', 'Sometextn', 'Sometextn'],
 ['2. Sometextn', 'Sometextn', 'Sometextn'], 
['3. Sometextn', 'Sometextn', 'Sometextn']]

最新更新