将项分解为行并将字典转换为列是无效的



给出以下样本数据:

data = [
{
"name": "Sampleproject 1",
"tasks": [
{
"order": 1,
"description": "Do something 1",
"status": "Done",
"active": False
},
{
"order": 3,
"description": "Do something 3",
"status": "Open",
"active": False
},
{
"order": 2,
"description": "Do something 2",
"status": "Open",
"active": True
}
]
},
{
"name": "Sampleproject 2",
"tasks": [
{
"order": 1,
"description": "Do something 1",
"status": "Done",
"active": False
},
{
"order": 1,
"description": "Do something 3",
"status": "Open",
"active": True
},
{
"order": 1,
"description": "Do something 2",
"status": "Open",
"active": True
},
{
"order": 2,
"description": "Do something 4",
"status": "Open",
"active": False
}
]
}
]

我需要规范化这些数据,以便将其写入CSV文件。任务中每个任务的单行与.deexplode((配合良好。之后,我有一列任务与任务对象。下一步是使用.apply(pd.Series(将此任务对象转换为单列,但这不起作用。我仍然得到完整的对象作为一列。

我为这个问题找到了许多类似的案例和教程,但对我的示例数据没有任何作用。我很确定这个解决方案很简单,但我看不出错误在哪里?

到目前为止我的代码:

import pandas as pd
df = pd.DataFrame(data)
df = df.explode("tasks")
df["tasks"].apply(pd.Series)
df.to_csv('testfile.csv', index=False)

CSV内容:

name,tasks
Sampleproject 1,"{'order': 1, 'description': 'Do something 1', 'status': 'Done', 'active': False}"
Sampleproject 1,"{'order': 3, 'description': 'Do something 3', 'status': 'Open', 'active': False}"
Sampleproject 1,"{'order': 2, 'description': 'Do something 2', 'status': 'Open', 'active': True}"
Sampleproject 2,"{'order': 1, 'description': 'Do something 1', 'status': 'Done', 'active': False}"
Sampleproject 2,"{'order': 1, 'description': 'Do something 3', 'status': 'Open', 'active': True}"
Sampleproject 2,"{'order': 1, 'description': 'Do something 2', 'status': 'Open', 'active': True}"
Sampleproject 2,"{'order': 2, 'description': 'Do something 4', 'status': 'Open', 'active': False}"

编辑:

要求/预期CSV输出:

name,order,description,status,active
Sampleproject 1,1,'Do something 1','Done',False
Sampleproject 1,3,'Do something 3','Open',False
Sampleproject 1,2,'Do something 2','Open',True
Sampleproject 2,1,'Do something 1','Done',False
Sampleproject 2,1,'Do something 3','Open',True
Sampleproject 2,1,'Do something 2','Open',True
Sampleproject 2,2,'Do something 4','Open',False

您可以使用concat将分解后的数据作为新列绑定到原始数据。

pd.concat([
df.drop(columns='tasks'),
df.tasks.explode().apply(pd.Series)
], axis=1)
#               name  order     description status  active
# 0  Sampleproject 1      1  Do something 1   Done   False
# 0  Sampleproject 1      3  Do something 3   Open   False
# 0  Sampleproject 1      2  Do something 2   Open    True
# 1  Sampleproject 2      1  Do something 1   Done   False
# 1  Sampleproject 2      1  Do something 3   Open    True
# 1  Sampleproject 2      1  Do something 2   Open    True
# 1  Sampleproject 2      2  Do something 4   Open   False

要处理嵌套数据,我将首先对数据进行规范化,然后再将其传递给Pandas,如下例所示:

def normalize(data): 
for elm in data: 
for k in elm.get('tasks', []): 
k.update({'name': elm.get('name')}) 
yield k
df = pd.dataFrame(normalize(data))
print(df)

输出:

order     description status  active             name
0      1  Do something 1   Done   False  Sampleproject 1
1      3  Do something 3   Open   False  Sampleproject 1
2      2  Do something 2   Open    True  Sampleproject 1
3      1  Do something 1   Done   False  Sampleproject 2
4      1  Do something 3   Open    True  Sampleproject 2
5      1  Do something 2   Open    True  Sampleproject 2
6      2  Do something 4   Open   False  Sampleproject 2

比较一下:

In [1]: %timeit pd.DataFrame(normalize(data))                             
661 µs ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit pd.concat([df.drop(columns='tasks'),df.tasks.explode().apply(pd.Series)], axis=1)
4.53 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

正如您所看到的,在将数据传递给Pandas之前对其进行规范化要比使用concat+explode函数快得多。

最新更新