我有以下数据帧:
ID Job Start Date
1 Driver 1951-01-01
1 Cleaner 2000-05-01
1 Staff 1951-01-01
2 Staff 2000-05-02
2 Staff2 2000-05-01
2 Cleaner 2000-04-01
5 Driver 1951-01-01
我需要创建一个名为";初级"其具有值"0";是";或";否";取决于的开始日期
对于每个ID,将列设置为";是";对于最早的";开始日期";如果有2行与相同的"ID"绑定;开始日期";选择一个设置为"0";是";(可能是第一个出现的(
所有其他行设置为"0";否";对于ID
在这种情况下,每个ID都有一行,其中有一个";是";(仅1个单行(和所有其它行为"1";否";产生这个数据帧:
ID Job Start Date Primary?
1 Driver 1951-01-01 Yes
1 Cleaner 2000-05-01 No
1 Staff 1951-01-01 No
2 Staff 2000-05-02 No
2 Staff2 2000-05-01 Yes
2 Cleaner 2000-04-01 Yes
5 Driver 1951-01-01 Yes
最好的方法是什么?
# set the primary start-date row as True/False, when its a min for the ID
df['Primary']= df['Start_Date'].eq(df.groupby(['ID'])['Start_Date'].transform(min))
# identify the duplicates start-dates
df.loc[df.duplicated(subset=['ID','Primary'], keep='first'),'Primary' ] = False
df
# Map True/False to Yes/No
df['Primary']=df['Primary'].map({True: 'Yes', False: 'No'})
df
+----+-----+----------+-------------+---------+
| | ID | Job | Start_Date | Primary |
+----+-----+----------+-------------+---------+
| 0 | 1 | Driver | 1951-01-01 | Yes |
| 1 | 1 | Cleaner | 2000-05-01 | No |
| 2 | 1 | Staff | 1951-01-01 | No |
| 3 | 2 | Staff | 2000-05-02 | No |
| 4 | 2 | Staff2 | 2000-05-01 | No |
| 5 | 2 | Cleaner | 2000-04-01 | Yes |
| 6 | 5 | Driver | 1951-01-01 | Yes |
+----+-----+----------+-------------+---------+
您可以使用transform
获取每个id出现的第一个日期,然后使用np.where
:
df['Primary'] = np.where(df['Start Date'] == df.groupby('ID')['Start Date'].
transform('first'), 'Yes', 'No')
或者,如果要将最短日期转换为Yes
,则可以设置transform('min')
。
此解决方案可处理以下工作:
(df
.assign(primary=lambda x: x.groupby("ID")["Start"].transform("min"))
.assign(keep=lambda x: x.groupby("ID")["Start"].transform(lambda x: x.duplicated("first")))
.assign(primary=lambda x: np.select([x.Start == x.primary],
["Yes"],
default="No")
)
.assign(primary=lambda x: np.select([(x.primary == "Yes") & ~(x.keep)],
["Yes"],
default="No")
)
)