我想知道,如何从大数据文件中的特定范围中提取一些数据?有没有办法阅读以"流行语"开头和结尾的内容。
我想在*NODE
和**
之间阅读每一行
*NODE
13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517
13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065
13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658
13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476
**
*NODE
之前和之后**
有数千行...
我知道它应该看起来像这样:
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# NOW THERE SHOULD FOLLOW SOMETHING LIKE:
# Go to next line and "a.append" till there comes the "magical"
# "**"
知道吗?我对python完全陌生。感谢您的帮助! 我希望你明白我的意思。
你几乎做到了 - 唯一缺少的是,一旦你找到开头,你就会搜索序列结束,直到发生这种情况,将你迭代的每一行附加到你的列表中。 即:
data = None # a placeholder to store your lines
with open("file.txt", "r") as f: # do not shadow the built-in `file`
for line in f: # iterate over the lines
if data is None: # we haven't found `NODE*` yet
if line[:5] == "NODE*": # search for `NODE*` at the line beginning
data = [] # make `data` an empty list to begin collecting
elif line[:2] == "**": # data initialized, we look for the sequence's end
break # no need to iterate over the file anymore
else: # data initialized but not at the end...
data.append(line) # append the line to our data
现在data
将包含NODE*
和**
之间的行列表,或者如果找不到序列,则None
。
试试这个:
with open('file.txt') as file:
a = []
running = False # avoid NameError when 'if' statement below isn't reached
for line in file:
if line.startswith('*NODE'):
running = True # show that we are starting to add values
continue # make sure we don't add '*NODE'
if line.startswith('**'):
running = False # show that we're done adding values
continue # make sure we don't add '**'
if running: # only add the values if 'running' is True
a.extend([i.strip() for i in line.split(',')])
输出是一个包含以下内容的列表:(我用过print('n'.join(a))
(
13021145
2637.6073002472617
55.011929824413045
206.0394346892517
13021146
2637.6051226039867
55.21115693303926
206.05686503802065
13021147
2634.226986419154
54.98263035830583
205.9520084547658
13021148
2634.224808775879
55.181857466932044
205.96943880353476
我们可以遍历行,直到没有任何剩余或我们已经到达块的末尾,例如
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# collect block-related lines
while True:
try:
line = next(file)
except StopIteration:
# there is no lines left
break
if line.startswith('**'):
# we've reached the end of block
break
a.append(line)
# stop iterating over file
break
会给我们
print(a)
['13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517n',
'13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065n',
'13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658n',
'13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476n']
或者,我们可以编写帮助器谓词,例如
def not_a_block_start(line):
return not line.startswith('*NODE')
def not_a_block_end(line):
return not line.startswith('**')
然后使用itertools
模块的光彩,例如
from itertools import (dropwhile,
takewhile)
with open('file.txt') as file:
block_start = dropwhile(not_a_block_start, file)
# skip block start line
next(block_start)
a = list(takewhile(not_a_block_end, block_start))
这将为我们提供相同的价值a
.