我有一个根文件夹
/home/project/data
里面有多个文件夹,csv文件的最终路径:
/home/project/data/2020-12-05/John_Smith/data.csv
/home/project/data/2020-12-05/Robert_White/data.csv
/home/project/data/2020-12-06/John_Smith/data.csv
/home/project/data/2020-12-06/Sam_Walberg/data.csv
/home/project/data/2020-12-06/Garry_Oswald/data.csv
...
我设法创建了一个数据帧,其中包含使用以下代码连接的所有csv文件:
full_path = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
full_path.append(os.path.join(subdir, file))
dfs = [pd.read_csv(csv_path) for csv_path in full_path]
df = pd.concat(dfs)
结果:
df=
pr_id quantity
0 27 4
1 89 1
2 33 2
3 8 3
4 16 1
...
但我现在很难将相应的日期+名称添加到数据框中,所以它看起来像这样:
df=
pr_id quantity name date
0 27 4 John_Smith 2020-12-05
1 89 1 Robert_White 2020-12-05
2 33 2 John_Smith 2020-12-06
3 8 3 Sam_Walberg 2020-12-06
4 16 1 Garry_Oswald 2020-12-06
...
我该怎么做?
使用pathlib
,您可以执行1&2个目录,得到CCD_ 2和CCD_。由于这涉及到两件事,显式for循环可能比列表理解更可读:
from pathlib import Path
# ...above are the same
dfs = []
for csv_path in full_path:
# generate a `Path` object and get parents
p = Path(csv_path)
parents = p.parents
# get the desired values from "parent" dirs
name = parents[0].name
date = parents[1].name
# read in the CSV as is
frame = pd.read_csv(csv_path)
# assign the `name` and `date` columns
frame["name"] = name
frame["date"] = date
# store in the list
dfs.append(frame)
# lastly concating as you did
df = pd.concat(dfs)
或者等效地,列表理解对应物是:
dfs = [pd.read_csv(csv_path).assign(name=csv_path.parents[0].name,
date=csv_path.parents[1].name)
for csv_path in map(Path, full_path)]
df = pd.concat(dfs)
其中我们使用CCD_ 4将新列放置到每个数据帧。
这取决于你在明确的循环理解还是列表理解之间做出选择。
我会做一些类似。。。
from glob import glob
import pandas as pd
final_df = pd.DataFrame()
csvs = glob('/home/project/data/*/*/data.csv')
for csv in csvs:
date = csv.split('/')[-2]
name = csv.split('/')[-3]
df = pd.read_csv(csv)
df['name'] = name
df['date'] = date
final_df = pd.concat([final_df, df])