我有一个具有以下结构的文件:
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
记录(即块(由空行分隔。块中的每一行都以一个SE
标记开始。CCD_ 2标签总是出现在每个块的第一行中。
我想知道如何正确地仅提取具有relation
标记的块,而CCD_3标签不一定存在于每个块中。我的尝试粘贴在下面:
from itertools import groupby
with open('test.txt') as f:
for nonempty, group in groupby(f, bool):
if nonempty:
process_block() ## ?
所需的输出是json转储:
{
"result": [
{
"text": "Baz",
"relation": ["Bla","Foo"]
},
{
"text": "Zoo",
"relation": ["Bla","Baz"]
}
]
}
我在纯python中提出了一个解决方案,如果它在任何位置包含值,就会返回一个块。这很可能会像熊猫一样,在一个合适的框架内做得更优雅。
from pprint import pprint
fname = 'ex.txt'
# extract blocks
with open(fname, 'r') as f:
blocks = [[]]
for line in f:
if len(line) == 1:
blocks.append([])
else:
blocks[-1] += [line.strip().split('|')]
# remove blocks that don't contain 'relation
blocks = [block for block in blocks
if any('relation' == x[1] for x in block)]
pprint(blocks)
# [[['SE', 'text', 'Baz'],
# ['SE', 'entity', 'Bla'],
# ['SE', 'relation', 'Bla'],
# ['SE', 'relation', 'Foo']],
# [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]
# To export to proper json format the following can be done
import pandas as pd
import json
results = []
for block in blocks:
df = pd.DataFrame(block)
json_dict = {}
json_dict['text'] = list(df[2][df[1] == 'text'])
json_dict['relation'] = list(df[2][df[1] == 'relation'])
results.append(json_dict)
print(json.dumps(results))
# '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'
让我们浏览一下
- 将文件读取到一个列表中,用一行空行分割每个块,并用
|
字符分割列 - 浏览列表中的每个块,并对任何不包含
relation
的块进行排序 - 打印输出
您不能像注释中提到的那样在字典中存储同一个键两次。您可以读取文件,在'nn'
将其拆分为块,在'n'
将块拆分为行,在CCD-8将行拆分为数据。
然后,您可以将其放入一个合适的数据结构中,并使用模块json:将其解析为字符串
创建数据文件:
with open("f.txt","w")as f:
f.write('''SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz''')
读取数据并进行处理:
with open("f.txt") as f:
all_text = f.read()
as_blocks = all_text.split("nn")
# skip SE when splitting and filter only with |relation|
with_relation = [[k.split("|")[1:]
for k in b.split("n")]
for b in as_blocks if "|relation|" in b]
print(with_relation)
创建一个合适的数据结构-将多个相同的密钥分组到一个列表中:
result = []
for inner in with_relation:
result.append({})
for k,v in inner:
# add as simple key
if k not in result[-1]:
result[-1][k] = v
# got key 2nd time, read it as list
elif k in result[-1] and not isinstance(result[-1][k], list):
result[-1][k] = [result[-1][k], v]
# got it a 3rd+ time, add to list
else:
result[-1][k].append(v)
print(result)
从数据结构创建json:
import json
print( json.dumps({"result":result}, indent=4))
输出:
# with_relation
[[['text', 'Baz'], ['entity', 'Bla'], ['relation', 'Bla'], ['relation', 'Foo']],
[['text', 'Zoo'], ['relation', 'Bla'], ['relation', 'Baz']]]
# result
[{'text': 'Baz', 'entity': 'Bla', 'relation': ['Bla', 'Foo']},
{'text': 'Zoo', 'relation': ['Bla', 'Baz']}]
# json string
{
"result": [
{
"text": "Baz",
"entity": "Bla",
"relation": [
"Bla",
"Foo"
]
},
{
"text": "Zoo",
"relation": [
"Bla",
"Baz"
]
}
]
}
在我看来,对于小型解析器来说,这是一个非常好的例子
此解决方案使用名为简约的PEG
解析器,但您完全可以使用另一个解析器:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import json
data = """
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
"""
class TagVisitor(NodeVisitor):
grammar = Grammar(r"""
content = (ws / block)+
block = line+
line = ~".+" nl?
nl = ~"[nr]"
ws = ~"s+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_content(self, node, visited_children):
filtered = [child[0] for child in visited_children if isinstance(child[0], dict)]
return {"result": filtered}
def visit_block(self, node, visited_children):
text, relations = None, []
for child in visited_children:
if child[1] == "text" and not text:
text = child[2].strip()
elif child[1] == "relation":
relations.append(child[2])
if relations:
return {"text": text, "relation": relations}
def visit_line(self, node, visited_children):
tag1, tag2, text = node.text.split("|")
return tag1, tag2, text.strip()
tv = TagVisitor()
result = tv.parse(data)
print(json.dumps(result))
这产生
{"result":
[{"text": "Baz", "relation": ["Bla", "Foo"]},
{"text": "Zoo", "relation": ["Bla", "Baz"]}]
}
这个想法是对语法进行短语化,用它构建一个抽象的语法树,并以合适的数据格式返回块的内容。