这是我的问题的一个简化示例。我有一个包含文件名和修改日期的数据框架。我需要根据文件名为最新文件设置一个标志。最新= 1;Not latest =0
到目前为止,我有以下代码:
import pandas as pd
df = pd.DataFrame({
'FileName' : ['file1.txt', 'file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file3.txt'],
'FileModDate' : ['2022-02-22 10:28:18', '2022-02-22 11:28:18', '2022-02-22 12:28:18', '2022-02-22 14:28:18', '2022-02-22 08:28:18', '2022-02-22 15:28:18'],
'DataDate' : ['2024-02-22 10:28:18', '2021-02-22 11:28:18', '2021-02-22 12:28:18', '2021-02-22 14:28:18', '2021-02-22 08:28:18', '2021-02-22 15:28:18']
})
df=df.sort_values('FileModDate',ascending=False)
print (df)
grouped=df.groupby('FileName').first()
grouped['LatestFile']=1
print (grouped)
结果是:
FileModDate LatestFile
FileName
file1.txt 2022-02-22 14:28:18 1
file2.txt 2022-02-22 11:28:18 1
file3.txt 2022-02-22 15:28:18 1
file4.txt 2022-02-22 08:28:18 1
我不知道这是不是正确的方法。我怎么能得到的行到这个数据框,而不是在grouby.first().
所以结果应该是这样的:
FileModDate LatestFile
FileName
file1.txt 2022-02-22 14:28:18 1
file2.txt 2022-02-22 11:28:18 1
file3.txt 2022-02-22 15:28:18 1
file4.txt 2022-02-22 08:28:18 1
file3.txt 2022-02-22 12:28:18 0
file1.txt 2022-02-22 10:28:18 0
最好f.t.
编辑:
根据第一个flag,我还需要一个flag:
如果LatestFile=1,标志DataDate应该只有1,所以输出是:
FileName FileModDate DataDate LatestFile DataDateFlag
0 file1.txt 2022-02-22 10:28:18 2024-02-22 10:28:18 0 0
1 file2.txt 2022-02-22 11:28:18 2021-02-22 11:28:18 1 1
2 file3.txt 2022-02-22 12:28:18 2021-02-22 12:28:18 0 0
3 file1.txt 2022-02-22 14:28:18 2021-02-22 14:28:18 1 1
4 file4.txt 2022-02-22 08:28:18 2021-02-22 08:28:18 1 1
5 file3.txt 2022-02-22 15:28:18 2021-02-22 15:28:18 1 1
我试过这样做:
df["DataDateFlag"] = (
df
.groupby("FileName")["DataDate"]
.transform("max")
.eq(df["DataDate"])
.astype(int)
.filter(df["LatestFile"]==1)
)
您可以将每个组转换为其最大日期:这将节省您对数组排序的需求,并且它允许您直接与实际日期进行比较:
import pandas as pd
df = pd.DataFrame({
'FileName' : ['file1.txt', 'file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file3.txt'],
'FileModDate' : ['2022-02-22 10:28:18', '2022-02-22 11:28:18', '2022-02-22 12:28:18', '2022-02-22 14:28:18', '2022-02-22 08:28:18', '2022-02-22 15:28:18']
})
df["LatestFile"] = (
df
.groupby("FileName")["FileModDate"]
.transform("max")
.eq(df["FileModDate"])
.astype(int)
)
输出(原顺序):
FileName FileModDate LatestFile
0 file1.txt 2022-02-22 10:28:18 0
1 file2.txt 2022-02-22 11:28:18 1
2 file3.txt 2022-02-22 12:28:18 0
3 file1.txt 2022-02-22 14:28:18 1
4 file4.txt 2022-02-22 08:28:18 1
5 file3.txt 2022-02-22 15:28:18 1
你可以用布尔值代替0/1:
df['LatestFile'] = df['FileModDate'] == df.groupby('FileName')['FileModDate'].transform(max)
输出:
FileName FileModDate LatestFile
5 file3.txt 2022-02-22 15:28:18 True
3 file1.txt 2022-02-22 14:28:18 True
2 file3.txt 2022-02-22 12:28:18 False
1 file2.txt 2022-02-22 11:28:18 True
0 file1.txt 2022-02-22 10:28:18 False
4 file4.txt 2022-02-22 08:28:18 True
如果您的数据已经按日期排序,您可以枚举具有groupby.cumcount
的组项以选择第一个(0
):
df['LatestFile'] = df.groupby('FileName').cumcount().eq(0).astype(int)
输出:
FileName FileModDate LatestFile
5 file3.txt 2022-02-22 15:28:18 1
3 file1.txt 2022-02-22 14:28:18 1
2 file3.txt 2022-02-22 12:28:18 0
1 file2.txt 2022-02-22 11:28:18 1
0 file1.txt 2022-02-22 10:28:18 0
4 file4.txt 2022-02-22 08:28:18 1