如何创建列名称作为csv文件路径一部分的Dataframe



我有一个根文件夹

/home/project/data

里面有多个文件夹,csv文件的最终路径:

/home/project/data/2020-12-05/John_Smith/data.csv
/home/project/data/2020-12-05/Robert_White/data.csv
/home/project/data/2020-12-06/John_Smith/data.csv
/home/project/data/2020-12-06/Sam_Walberg/data.csv
/home/project/data/2020-12-06/Garry_Oswald/data.csv
...

我设法创建了一个数据帧,其中包含使用以下代码连接的所有csv文件:

full_path = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
full_path.append(os.path.join(subdir, file))
dfs = [pd.read_csv(csv_path) for csv_path in full_path]
df = pd.concat(dfs)

结果:

df=
pr_id   quantity
0   27  4
1   89  1
2   33  2
3   8   3
4   16  1
...

但我现在很难将相应的日期+名称添加到数据框中,所以它看起来像这样:

df=
pr_id quantity name date
0   27  4 John_Smith 2020-12-05
1   89  1 Robert_White 2020-12-05
2   33  2 John_Smith 2020-12-06
3   8   3 Sam_Walberg 2020-12-06
4   16  1 Garry_Oswald 2020-12-06
...

我该怎么做?

使用pathlib,您可以执行1&2个目录,得到CCD_ 2和CCD_。由于这涉及到两件事,显式for循环可能比列表理解更可读:

from pathlib import Path
# ...above are the same
dfs = []
for csv_path in full_path:
# generate a `Path` object and get parents
p = Path(csv_path)
parents = p.parents
# get the desired values from "parent" dirs
name = parents[0].name
date = parents[1].name
# read in the CSV as is
frame = pd.read_csv(csv_path)

# assign the `name` and `date` columns
frame["name"] = name
frame["date"] = date
# store in the list
dfs.append(frame)
# lastly concating as you did
df = pd.concat(dfs)

或者等效地,列表理解对应物是:

dfs = [pd.read_csv(csv_path).assign(name=csv_path.parents[0].name,
date=csv_path.parents[1].name)
for csv_path in map(Path, full_path)]
df = pd.concat(dfs)

其中我们使用CCD_ 4将新列放置到每个数据帧。

这取决于你在明确的循环理解还是列表理解之间做出选择。

我会做一些类似。。。

from glob import glob
import pandas as pd
final_df = pd.DataFrame()
csvs = glob('/home/project/data/*/*/data.csv')
for csv in csvs:
date = csv.split('/')[-2]
name = csv.split('/')[-3]
df = pd.read_csv(csv)
df['name'] = name
df['date'] = date
final_df = pd.concat([final_df, df])