使用Struct和Array结构将Pandas数据帧转换为JSON,以便上传到BigQuery



假设我有来自列为idtitlecategorysubcategorysub-subcategory的DataFrame的数据,看起来像:

_________________________________________________________________
| id | title        | cat          |  subcat     |    subsubcat |
|____|______________|______________|_____________|______________|
| 1  | My Book      | cat1         |  subcat1    |  subsubcat1  |
| 1  | My Book      | cat2         |  subcat2    |  subsubcat2  |
| 2  | My Other Book| othercat1    | othersubcat1| othersubcat1 |
| 2  | My Other Book| othercat2    | othersubcat2|       null   |
| 2  | My Other Book| othercat3    |     null    |       null   |
|_______________________________________________________________|

我想把它变成一个(换行符分隔的(json,它的结构像:

[
{
'id' : '1',
'title' : 'My Book',
'categoryHiearchies': [
{'categories': ['category1', 'subcategory1', 'sub-subcategory1']},
{'categories': ['category2', 'subcategory2', 'sub-subcategory2']}
]
},
{
'id' : '2',
'title' : 'My Other Book',
'categoryHiearchies': [
{'categories': ['othercategory1', 'othersubcategory1', 'othersub-subcategory1']},
{'categories': ['othercategory2', 'othersubcategory2']},
{'categories': ['othercategory3']},
]
}
]

以便将其正确上传到BigQuery。

有什么想法可以应用这种转换吗?

假设nullNan值:

(df.set_index(['id','title'],append=True).stack()
.groupby(level=[0,1,2]).agg(lambda x: {'categories':list(x)})
.groupby(level=[1,2]).agg(list)
.reset_index(name='categoryHiearchies')
.to_json(orient='records', indent=2)   
)

它给出

[
{
"id":1,
"title":"My Book",
"categoryHiearchies":[
{
"categories":[
"cat1",
"subcat1",
"subsubcat1"
]
},
{
"categories":[
"cat2",
"subcat2",
"subsubcat2"
]
}
]
},
{
"id":2,
"title":"My Other Book",
"categoryHiearchies":[
{
"categories":[
"othercat1",
"othersubcat1",
"othersubcat1"
]
},
{
"categories":[
"othercat2",
"othersubcat2"
]
},
{
"categories":[
"othercat3"
]
}
]
}
]

最新更新