假设我有来自列为id
、title
、category
、subcategory
和sub-subcategory
的DataFrame的数据,看起来像:
_________________________________________________________________
| id | title | cat | subcat | subsubcat |
|____|______________|______________|_____________|______________|
| 1 | My Book | cat1 | subcat1 | subsubcat1 |
| 1 | My Book | cat2 | subcat2 | subsubcat2 |
| 2 | My Other Book| othercat1 | othersubcat1| othersubcat1 |
| 2 | My Other Book| othercat2 | othersubcat2| null |
| 2 | My Other Book| othercat3 | null | null |
|_______________________________________________________________|
我想把它变成一个(换行符分隔的(json,它的结构像:
[
{
'id' : '1',
'title' : 'My Book',
'categoryHiearchies': [
{'categories': ['category1', 'subcategory1', 'sub-subcategory1']},
{'categories': ['category2', 'subcategory2', 'sub-subcategory2']}
]
},
{
'id' : '2',
'title' : 'My Other Book',
'categoryHiearchies': [
{'categories': ['othercategory1', 'othersubcategory1', 'othersub-subcategory1']},
{'categories': ['othercategory2', 'othersubcategory2']},
{'categories': ['othercategory3']},
]
}
]
以便将其正确上传到BigQuery。
有什么想法可以应用这种转换吗?
假设null
为Nan
值:
(df.set_index(['id','title'],append=True).stack()
.groupby(level=[0,1,2]).agg(lambda x: {'categories':list(x)})
.groupby(level=[1,2]).agg(list)
.reset_index(name='categoryHiearchies')
.to_json(orient='records', indent=2)
)
它给出
[
{
"id":1,
"title":"My Book",
"categoryHiearchies":[
{
"categories":[
"cat1",
"subcat1",
"subsubcat1"
]
},
{
"categories":[
"cat2",
"subcat2",
"subsubcat2"
]
}
]
},
{
"id":2,
"title":"My Other Book",
"categoryHiearchies":[
{
"categories":[
"othercat1",
"othersubcat1",
"othersubcat1"
]
},
{
"categories":[
"othercat2",
"othersubcat2"
]
},
{
"categories":[
"othercat3"
]
}
]
}
]