我有一些csv数据需要转换为特定的json格式。我写了一个代码,为一些嵌套的水平工作,但不是按要求
这是我的csv数据:title context answers question id
tit1 con1 text1 que1 id1
tit1 con1 text2 que2 id2
tit2 con2 text3 que3 id3
tit2 con2 text4 que4 id4
tit2 con3 text5 que5 id5
我代码:
df = pd.read_csv('processedOutput.csv')
finalList = []
finalDict = {}
grouped = df.groupby(['context'])
for key, value in grouped:
dictionary = {}
j = grouped.get_group(key).reset_index(drop=True)
dictionary['context'] = j.at[0, 'context']
dictList = []
anotherDict = {}
for i in j.index:
anotherDict['answers'] = j.at[i, 'answers']
anotherDict['question'] = j.at[i, 'question']
anotherDict['id'] = j.at[i, 'id']
dictList.append(anotherDict)
dictionary['qas'] = dictList
finalList.append(dictionary)
import json
data = json.dumps(finalList)
,其输出结构良好,但只取分组项的最后一个元素
[{"context": "con1",
"qas": [
{"answers": "text2", "question": "que2", "id": "id2"},
{"answers": "text2", "question": "que2", "id": "id2"}
]
},
{"context": "con2",
"qas": [
{"answers": "text4", "question": "que4", "id": "id4"},
{"answers": "text4", "question": "que4", "id": "id4"}
]
},
{"context": "con3",
"qas": [
{"answers": "text5", "question": "que5", "id": "id5"}
]
}
]
想让所有字段的数据嵌套一层,如下所示:
[
{
"title": "tit1",
"paragraph": [
{
"context": "con1",
"qas": [
{"answers": "text1","question": "que1","id": "id1"},
{"answers": "text2","question": "que2","id": "id2"}
]}]
},
{
"title": "tit2",
"paragraph": [
{
"context": "con2",
"qas": [
{"answers": "text3","question": "que3","id": "id3"},
{"answers": "text4","question": "que4","id": "id4"}
],
"context": "con3",
"qas": [
{"answers": "text5","question":"que5", "id": "id5"}
]
}
]
}
]
在这个问题上坚持了很长时间,任何建议都会很好
您的输出数据需要3个级别的分组:标题、段落和q&a。我建议使用df.groupby(['title', 'context', 'answers'])
来驱动环路。
然后,在循环中,每个组将由一个字典(假设)组成id
列只包含唯一的值)。为了构建更高层次的结构,你所需要做的只是做一些记录来检测关卡变化,并将其添加到适当的列表和字典中。我们将使用更多的groupby
级别来做到这一点:
...
g1 = df.groupby(['title'])
for k1, v1 in g1:
l2_para_list = []
l4_qas_list = []
g2 = v1.groupby(['context'])
for k2, v2 in g2:
g3 = v2.groupby(['answers'])
for _, v3 in g3:
qas_dict = {}
qas_dict['answers'] = v3.answers.item()
qas_dict['question'] = v3.question.item()
qas_dict['id'] = v3.id.item()
l4_qas_list.append(qas_dict)
l3_para_dict = {}
l3_para_dict['context'] = k2
l3_para_dict['qas'] = l4_qas_list
l4_qas_list = []
l2_para_list.append(l3_para_dict)
l3_para_dict = {}
l1_title_dict = {}
l1_title_dict['title'] = k1
l1_title_dict['paragraph'] = l2_para_list
finalList.append(l1_title_dict)
l1_title_dict = {}
l2_para_list = []
print(json.dumps(finalList))
...
输出(为显示而格式化)
[{"title": "tit1", "paragraph":
[{"context": "con1",
"qas": [{"answers": "text1", "question": "que1", "id": "id1"},
{"answers": "text2", "question": "que2", "id": "id2"}]}]},
{"title": "tit2", "paragraph":
[{"context": "con2",
"qas": [{"answers": "text3", "question": "que3", "id": "id3"},
{"answers": "text4", "question": "que4", "id": "id4"}]},
{"context": "con3",
"qas": [{"answers": "text5", "question": "que5", "id": "id5"}]}]}]