只分析空行分隔文件中的选定记录



我有一个具有以下结构的文件:

SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz

记录(即块(由空行分隔。块中的每一行都以一个SE标记开始。CCD_ 2标签总是出现在每个块的第一行中。

我想知道如何正确地仅提取具有relation标记的块,而CCD_3标签不一定存在于每个块中。我的尝试粘贴在下面:

from itertools import groupby
with open('test.txt') as f:
for nonempty, group in groupby(f, bool):
if nonempty:
process_block() ## ?

所需的输出是json转储:

{
"result": [
{
"text": "Baz", 
"relation": ["Bla","Foo"]
},
{
"text": "Zoo", 
"relation": ["Bla","Baz"]
}
]
}

我在纯python中提出了一个解决方案,如果它在任何位置包含值,就会返回一个块。这很可能会像熊猫一样,在一个合适的框架内做得更优雅。

from pprint import pprint
fname = 'ex.txt'
# extract blocks
with open(fname, 'r') as f:
blocks = [[]]
for line in f:
if len(line) == 1:
blocks.append([])
else:
blocks[-1] += [line.strip().split('|')]
# remove blocks that don't contain 'relation
blocks = [block for block in blocks
if any('relation' == x[1] for x in block)]
pprint(blocks)
# [[['SE', 'text', 'Baz'],
#   ['SE', 'entity', 'Bla'],
#   ['SE', 'relation', 'Bla'],
#   ['SE', 'relation', 'Foo']],
#  [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]

# To export to proper json format the following can be done
import pandas as pd
import json
results = []
for block in blocks:
df = pd.DataFrame(block)
json_dict = {}
json_dict['text'] = list(df[2][df[1] == 'text'])
json_dict['relation'] = list(df[2][df[1] == 'relation'])
results.append(json_dict)
print(json.dumps(results))
# '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'

让我们浏览一下

  1. 将文件读取到一个列表中,用一行空行分割每个块,并用|字符分割列
  2. 浏览列表中的每个块,并对任何不包含relation的块进行排序
  3. 打印输出

您不能像注释中提到的那样在字典中存储同一个键两次。您可以读取文件,在'nn'将其拆分为块,在'n'将块拆分为行,在CCD-8将行拆分为数据。

然后,您可以将其放入一个合适的数据结构中,并使用模块json:将其解析为字符串

创建数据文件:

with open("f.txt","w")as f:
f.write('''SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz''')

读取数据并进行处理:

with open("f.txt") as f:
all_text = f.read()
as_blocks = all_text.split("nn")
# skip SE when splitting and filter only with |relation|
with_relation = [[k.split("|")[1:]
for k in b.split("n")]
for b in as_blocks if "|relation|" in b]
print(with_relation)

创建一个合适的数据结构-将多个相同的密钥分组到一个列表中:

result = []
for inner in with_relation:
result.append({})
for k,v in inner:
# add as simple key
if k not in result[-1]:
result[-1][k] = v
# got key 2nd time, read it as list
elif k in result[-1] and not isinstance(result[-1][k], list):
result[-1][k] = [result[-1][k], v]
# got it a 3rd+ time, add to list
else:
result[-1][k].append(v)
print(result)

从数据结构创建json:

import json
print( json.dumps({"result":result}, indent=4))

输出:

# with_relation
[[['text', 'Baz'], ['entity', 'Bla'], ['relation', 'Bla'], ['relation', 'Foo']], 
[['text', 'Zoo'], ['relation', 'Bla'], ['relation', 'Baz']]]
# result
[{'text': 'Baz', 'entity': 'Bla', 'relation': ['Bla', 'Foo']}, 
{'text': 'Zoo', 'relation': ['Bla', 'Baz']}]
# json string
{
"result": [
{
"text": "Baz",
"entity": "Bla",
"relation": [
"Bla",
"Foo"
]
},
{
"text": "Zoo",
"relation": [
"Bla",
"Baz"
]
}
]
}

在我看来,对于小型解析器来说,这是一个非常好的例子
此解决方案使用名为简约PEG解析器,但您完全可以使用另一个解析器:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import json
data = """
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
"""

class TagVisitor(NodeVisitor):
grammar = Grammar(r"""
content = (ws / block)+
block   = line+
line    = ~".+" nl?
nl      = ~"[nr]"
ws      = ~"s+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_content(self, node, visited_children):
filtered = [child[0] for child in visited_children if isinstance(child[0], dict)]
return {"result": filtered}
def visit_block(self, node, visited_children):
text, relations = None, []
for child in visited_children:
if child[1] == "text" and not text:
text = child[2].strip()
elif child[1] == "relation":
relations.append(child[2])
if relations:
return {"text": text, "relation": relations}
def visit_line(self, node, visited_children):
tag1, tag2, text = node.text.split("|")
return tag1, tag2, text.strip()

tv = TagVisitor()
result = tv.parse(data)
print(json.dumps(result))

这产生

{"result": 
[{"text": "Baz", "relation": ["Bla", "Foo"]}, 
{"text": "Zoo", "relation": ["Bla", "Baz"]}]
}

这个想法是对语法进行短语化,用它构建一个抽象的语法树,并以合适的数据格式返回块的内容。

相关内容

  • 没有找到相关文章

最新更新