给出以下样本数据:
data = [
{
"name": "Sampleproject 1",
"tasks": [
{
"order": 1,
"description": "Do something 1",
"status": "Done",
"active": False
},
{
"order": 3,
"description": "Do something 3",
"status": "Open",
"active": False
},
{
"order": 2,
"description": "Do something 2",
"status": "Open",
"active": True
}
]
},
{
"name": "Sampleproject 2",
"tasks": [
{
"order": 1,
"description": "Do something 1",
"status": "Done",
"active": False
},
{
"order": 1,
"description": "Do something 3",
"status": "Open",
"active": True
},
{
"order": 1,
"description": "Do something 2",
"status": "Open",
"active": True
},
{
"order": 2,
"description": "Do something 4",
"status": "Open",
"active": False
}
]
}
]
我需要规范化这些数据,以便将其写入CSV文件。任务中每个任务的单行与.deexplode((配合良好。之后,我有一列任务与任务对象。下一步是使用.apply(pd.Series(将此任务对象转换为单列,但这不起作用。我仍然得到完整的对象作为一列。
我为这个问题找到了许多类似的案例和教程,但对我的示例数据没有任何作用。我很确定这个解决方案很简单,但我看不出错误在哪里?
到目前为止我的代码:
import pandas as pd
df = pd.DataFrame(data)
df = df.explode("tasks")
df["tasks"].apply(pd.Series)
df.to_csv('testfile.csv', index=False)
CSV内容:
name,tasks
Sampleproject 1,"{'order': 1, 'description': 'Do something 1', 'status': 'Done', 'active': False}"
Sampleproject 1,"{'order': 3, 'description': 'Do something 3', 'status': 'Open', 'active': False}"
Sampleproject 1,"{'order': 2, 'description': 'Do something 2', 'status': 'Open', 'active': True}"
Sampleproject 2,"{'order': 1, 'description': 'Do something 1', 'status': 'Done', 'active': False}"
Sampleproject 2,"{'order': 1, 'description': 'Do something 3', 'status': 'Open', 'active': True}"
Sampleproject 2,"{'order': 1, 'description': 'Do something 2', 'status': 'Open', 'active': True}"
Sampleproject 2,"{'order': 2, 'description': 'Do something 4', 'status': 'Open', 'active': False}"
编辑:
要求/预期CSV输出:
name,order,description,status,active
Sampleproject 1,1,'Do something 1','Done',False
Sampleproject 1,3,'Do something 3','Open',False
Sampleproject 1,2,'Do something 2','Open',True
Sampleproject 2,1,'Do something 1','Done',False
Sampleproject 2,1,'Do something 3','Open',True
Sampleproject 2,1,'Do something 2','Open',True
Sampleproject 2,2,'Do something 4','Open',False
您可以使用concat将分解后的数据作为新列绑定到原始数据。
pd.concat([
df.drop(columns='tasks'),
df.tasks.explode().apply(pd.Series)
], axis=1)
# name order description status active
# 0 Sampleproject 1 1 Do something 1 Done False
# 0 Sampleproject 1 3 Do something 3 Open False
# 0 Sampleproject 1 2 Do something 2 Open True
# 1 Sampleproject 2 1 Do something 1 Done False
# 1 Sampleproject 2 1 Do something 3 Open True
# 1 Sampleproject 2 1 Do something 2 Open True
# 1 Sampleproject 2 2 Do something 4 Open False
要处理嵌套数据,我将首先对数据进行规范化,然后再将其传递给Pandas,如下例所示:
def normalize(data):
for elm in data:
for k in elm.get('tasks', []):
k.update({'name': elm.get('name')})
yield k
df = pd.dataFrame(normalize(data))
print(df)
输出:
order description status active name
0 1 Do something 1 Done False Sampleproject 1
1 3 Do something 3 Open False Sampleproject 1
2 2 Do something 2 Open True Sampleproject 1
3 1 Do something 1 Done False Sampleproject 2
4 1 Do something 3 Open True Sampleproject 2
5 1 Do something 2 Open True Sampleproject 2
6 2 Do something 4 Open False Sampleproject 2
比较一下:
In [1]: %timeit pd.DataFrame(normalize(data))
661 µs ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit pd.concat([df.drop(columns='tasks'),df.tasks.explode().apply(pd.Series)], axis=1)
4.53 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
正如您所看到的,在将数据传递给Pandas之前对其进行规范化要比使用concat
+explode
函数快得多。