我有一个文本行列表,由交替的节标题和节内容组成。我想一行接一行地解析它,并识别部分及其相关内容(最终将其合并到字典中(。
我遇到的麻烦是,只需要迭代列表并查找标题,就可以将行解析为对。每次我试着靠得很近,但不知怎么的,我的部分都错位了。
我认为我的算法应该如下:
(0(假设在搜索开始时没有识别出报头;因此,看到的任何内容都将被忽略,直到遇到节头为止。
(1( 当";在";一个节(即遇到了一个节头(,累积下面的所有节内容并将其附加在一起,直到看到新的节头为止。
(2( 在遇到新的节标题时,以下任何行都应被视为新节的一部分。
(3( 有些部分可能只有一个标题,因此内容为空。其他线路可能跨越一条或多条线路。
换句话说,考虑到这一点:
garbage
Section-A-Header
section A content line 1
section A content line 2
section A content line 3
Section-B-Header
section B content line 1
section B content line 2
Section-C-Header
Section-D-Header
section D content line 1
section D content line 2
section D content line 3
我希望能够构建:
{Section-A-Header: section A content line 1 + section A content line 2 + section A content line 3}
{Section-B-Header: section B content line 1 + section B content line 2}
{Section-C-Header: None}
{Section-D-Header: section D content line 1 + section D content line 2 + section D content line 3}
有人能帮我想出一个可靠的实施方案吗?
UPDATE我正在处理的真实代码的示例数据位于此处。
我不确定您面临的确切问题是什么。
这里有一个伪代码,供您从中获得灵感
file = open("sections.txt", 'r')
last_header=''
output = {}
for line in file.readlines():
if is_section_header(line):
last_header = line
output[line] = ""
else:
existing_data = output[last_header]
output[last_header] = existing_data + line
print(output)
def is_section_header(line):
#some logic to identify header
return True
这将是我的方法:
result = dict()
with open('foo.txt') as foo:
section = None
for line in map(str.strip, foo):
# identify start of section
if line.startswith('Section-'):
section = line
result[section] = None
else:
if section:
if result[section]:
result[section].append(line)
else:
result[section] = [line]
结果:
{
"Section-A-Header": [
"section A content line 1",
"section A content line 2",
"section A content line 3"
],
"Section-B-Header": [
"section B content line 1",
"section B content line 2"
],
"Section-C-Header": None,
"Section-D-Header": [
"section D content line 1",
"section D content line 2",
"section D content line 3"
]
}
注意:
这样写只是因为OP希望空部分为None
有些人希望看到我正在处理的实际数据的示例(我试图避免这种情况,因为它比我上面给出的示例数据要复杂得多(。这些数据是在测试运行期间从Pytest输出的,因为它被发送到控制台,所以它在大多数文本行中都嵌入了ANSI编码。我以前没有包括这一点,因为我的困难不是解析文本,而是创建逐行查看输出的整体算法。
这都是我正在开发的Pytest插件的一部分,该插件提供了一个自动启动的文本用户界面,有望使处理Pytest的详细输出变得更容易分析。
======================================================================================== FAILURES ========================================================================================
[31m[1m______________________________________________________________________________________ test_b_fail _______________________________________________________________________________________[0m
[94mdef[39;49;00m [92mtest_b_fail[39;49;00m():
> [94massert[39;49;00m [94m0[39;49;00m
[1m[31mE assert 0[0m
[1m[31mtests/test_pytest_fold_1.py[0m:26: AssertionError
[31m[1m___________________________________________________________________________ test_g_eval_parameterized[6*9-42] ____________________________________________________________________________[0m
test_input = '6*9', expected = 42
[37m@pytest[39;49;00m.mark.parametrize([33m"[39;49;00m[33mtest_input, expected[39;49;00m[33m"[39;49;00m, [([33m"[39;49;00m[33m3+5[39;49;00m[33m"[39;49;00m, [94m8[39;49;00m), ([33m"[39;49;00m[33m2+4[39;49;00m[33m"[39;49;00m, [94m6[39;49;00m), ([33m"[39;49;00m[33m6*9[39;49;00m[33m"[39;49;00m, [94m42[39;49;00m)])
[94mdef[39;49;00m [92mtest_g_eval_parameterized[39;49;00m(test_input, expected):
> [94massert[39;49;00m [96meval[39;49;00m(test_input) == expected
[1m[31mE AssertionError: assert 54 == 42[0m
[1m[31mE + where 54 = eval('6*9')[0m
[1m[31mtests/test_pytest_fold_1.py[0m:48: AssertionError
我最终获得成功的代码是基于现象一的答案。我的正则表达式定义是:
r"x1b[31mx1b[1m__+W(S+)W__+x1b[0m"
处理代码为:
def _get_tracebacks(self, section_name: str, regex: str) -> dict:
last_header = ""
output = {}
lines = re.split("n", self.Sections[section_name].content)
for line in lines:
result = re.search(regex, line)
if result:
last_header = result.groups()[0]
output[last_header] = ""
else:
if not last_header:
continue
existing_data = output[last_header]
output[last_header] = existing_data + "n" + line
return output
感谢所有参与讨论的人!