在数据帧中跟踪一个文件,并用panda将该位置放在另一列中



我有一个像这样的数据帧:

file name           folder 1              folder 2                folder 3            
0     1.txt        2021-05-11 14:04:30    2021-05-11 15:04:30      2021-05-11 16:04:30
1     2.txt        2021-05-12 14:04:30    2021-05-12 15:04:30            NaN
2     3.txt        2021-05-13 14:04:30            NaN                    NaN

所以,我有一些文件,当它们移动到另一个文件夹时,我注册了时间戳。基本上,我想要的是在这个数据帧名称中创建另一列";文件的位置";,因此,如果最后一个时间戳在特定的文件夹中,请将文件夹名称放在新列中。理想的输出看起来像这样:

file name         folder 1              folder 2               folder 3        Location of the file
0   1.txt     2021-05-11 14:04:30   2021-05-11 15:04:30    2021-05-11 16:04:30    file is in folder 3
1   2.txt     2021-05-12 14:04:30   2021-05-12 15:04:30          NaN              file is in folder 2         
2   3.txt     2021-05-13 14:04:30          NaN                   NaN              file is in folder 1

编辑:

嘿@Andrej,在我的真实场景中,我在文件夹1和文件名之间有一些列,就像我放在那里告诉文件类型的那个列。如果我使用您的解决方案,那么文件4.xlsx的位置将打印如下数据帧:

file name  filetype   .. folder 1              ..  folder 3           Loc. of file
0   1.txt      txt      .. 2021-05-11 14:04:30   ..  2021-05-11 16:04:30 file is in folder 3
1   2.txt      txt      .. 2021-05-12 14:04:30   ..     NaN              file is in folder 2
2   3.txt      txt      .. 2021-05-13 14:04:30   ..     NaN              file is in folder 3
3   4.xlsx     xlsx     ..   NaN                 ..     NaN              file is in filetype

我能以某种方式仅从列"中获得文件的位置吗;文件夹1"文件夹2";以及";文件夹3";。如果文件夹1中没有时间戳;文件的位置";列,如下df所示:

file name  filetype   .. folder 1              ..  folder 3           Loc. of file
0   1.txt      txt      .. 2021-05-11 14:04:30   ..  2021-05-11 16:04:30 file is in folder 3
1   2.txt      txt      .. 2021-05-12 14:04:30   ..     NaN              file is in folder 2
2   3.txt      txt      .. 2021-05-13 14:04:30   ..     NaN              file is in folder 3
3   4.xlsx     xlsx     ..   NaN                 ..     NaN                NaN
df["Location of the file"] = df.apply(
lambda x: "File is in {}".format(x.index[x.notna()][-1]),
axis=1,
)
print(df)

打印:

file name             folder 1             folder 2             folder 3 Location of the file
0     1.txt  2021-05-11 14:04:30  2021-05-11 15:04:30  2021-05-11 16:04:30  File is in folder 3
1     2.txt  2021-05-12 14:04:30  2021-05-12 15:04:30                  NaN  File is in folder 2
2     3.txt  2021-05-13 14:04:30                  NaN                  NaN  File is in folder 1

编辑:处理所有NaN:

df["Location of the file"] = df.loc[:, "folder 1":].apply(
lambda x: "File is in {}".format(x.index[x.notna()][-1])
if not x.isna().all()
else np.nan,
axis=1,
)
print(df)

打印:

file name             folder 1             folder 2             folder 3 Location of the file
0     1.txt  2021-05-11 14:04:30  2021-05-11 15:04:30  2021-05-11 16:04:30  File is in folder 3
1     2.txt  2021-05-12 14:04:30  2021-05-12 15:04:30                  NaN  File is in folder 2
2     3.txt                  NaN                  NaN                  NaN                  NaN

您可以在axis=1(列索引(上使用.idxmax()来获取具有最大日期时间值的元素的索引标签和文本concat。使用"file is in"获取新的列内容,如下所示:

df['Location of the file'] = 'file is in ' + df.iloc[:, 1:].idxmax(axis=1)

请注意,此解决方案并不假定文件夹1、文件夹2、文件夹3是按时间顺序创建的(文件夹3晚于文件夹2和文件夹1(。只要这3列包含日期时间字段,就可以进行比较以获得最新时间

结果:

print(df)
file name            folder 1            folder 2            folder 3 Location of the file
0     1.txt 2021-05-11 14:04:30 2021-05-11 15:04:30 2021-05-11 16:04:30  file is in folder 3
1     2.txt 2021-05-12 14:04:30 2021-05-12 15:04:30                 NaT  file is in folder 2
2     3.txt 2021-05-13 14:04:30                 NaT                 NaT  file is in folder 1

相关内容

最新更新