现有数据帧:
Unique_Id sms_sent_date
A 11-01-2022
A 11-01-2022
A 12-01-2022
A 13-01-2022
B 02-02-2022
B 03-02-2022
B 06-02-2022
B 06-02-2022
预期数据帧:
Unique_Id sms_sent_date unique_sms_counter
A 11-01-2022 1
A 11-01-2022 1
A 12-01-2022 2
A 13-01-2022 3
B 02-02-2022 1
B 03-02-2022 2
B 06-02-2022 3
B 06-02-2022 3
我试图添加smsq_counter,它将计算发送到唯一Id的唯一sms。尝试了这个df.groupby(['Unique_Id','sms_sent_date']).cumcount()
,但坚持将其应用于整个数据帧
您可以使用:
df['unique_sms_counter'] = (
# get non duplicates
(~df.duplicated(['Unique_Id', 'sms_sent_date']))
# increment per group
.groupby(df['Unique_Id']).cumsum()
)
输出:
Unique_Id sms_sent_date unique_sms_counter
0 A 11-01-2022 1
1 A 11-01-2022 1
2 A 12-01-2022 2
3 A 13-01-2022 3
4 B 02-02-2022 1
5 B 03-02-2022 2
6 B 06-02-2022 3
7 B 06-02-2022 3
您可以在groupby:上使用factorize
df['unique_sms_counter'] = df.groupby(['Unique_Id'])['sms_sent_date'].transform(lambda x: x.factorize()[0]+1)
或者,如果您的日期是实际日期时间类型,您可以使用rank
:
df['sms_sent_date'] = pd.to_datetime(df['sms_sent_date'], dayfirst=True)
df.groupby('Unique_Id')['sms_sent_date'].rank('dense').astype(int)
输出:
Unique_Id sms_sent_date unique_sms_counter
0 A 11-01-2022 1
1 A 11-01-2022 1
2 A 12-01-2022 2
3 A 13-01-2022 3
4 B 02-02-2022 1
5 B 03-02-2022 2
6 B 06-02-2022 3
7 B 06-02-2022 3