如何从CoNLL格式转换为空格格式



我目前正在研究NER模型。我有一堆数据存储在CoNLL格式,需要转换为空间格式。在CoNLL中,句子的每个单词旁边都有一个标签。在space中,标记只显示给具有实际标记的单词。如何从下面的格式转换(CoNLL)

From    O
2001    B-DateTime
to  I-DateTime
2004    I-DateTime
,   O
I   O
was O
a   O
stagehand   O
for O
Hartford    B-Company
Stage   I-Company
Company O
.   O

到下面的格式(space)

TRAIN_DATA = [('what is the price of polo?', {'entities': [(21, 25, 'PrdName')]}), 
('what is the price of ball?', {'entities': [(21, 25, 'PrdName')]}), 
('what is the price of jegging?', {'entities': [(21, 28, 'PrdName')]}), 
('what is the price of t-shirt?', {'entities': [(21, 28, 'PrdName')]}), 
('what is the price of jeans?', {'entities': [(21, 26, 'PrdName')]}), 
('what is the price of bat?', {'entities': [(21, 24, 'PrdName')]}), 
('what is the price of shirt?', {'entities': [(21, 26, 'PrdName')]}), 
('what is the price of bag?', {'entities': [(21, 24, 'PrdName')]}), 
('what is the price of cup?', {'entities': [(21, 24, 'PrdName')]}), 
('what is the price of jug?', {'entities': [(21, 24, 'PrdName')]}), 
('what is the price of plate?', {'entities': [(21, 26, 'PrdName')]}), 
('what is the price of glass?', {'entities': [(21, 26, 'PrdName')]}),
('what is the price of watch?', {'entities': [(21, 26, 'PrdName')]})]

直接使用空格转换。

spacy convert input.conll -c conll ./output/

注意,默认情况下,这会生成一个二进制的.spacy文件。JSON格式在v3中已弃用,并没有多大帮助。

最新更新