我有以下数据框架my_df
:
name timestamp color
---------------------------
John 2017-01-01 blue
John 2017-01-02 blue
John 2017-01-03 blue
John 2017-01-04 yellow
John 2017-01-05 red
John 2017-01-06 red
Ann 2017-01-04 green
Ann 2017-01-05 orange
Ann 2017-01-06 orange
Ann 2017-01-07 red
Ann 2017-01-08 black
Dan 2017-01-11 blue
Dan 2017-01-12 blue
Dan 2017-01-13 green
Dan 2017-01-14 yellow
然后,我使用以下代码查找每个人的颜色序列:
new_df = my_df.groupby(['name'], as_index=False).color
.agg({"color_list": lambda x: list(x)})
然后new_df
看起来像:
name color_list
-----------------------------------------------
John blue, blue, blue, yellow, red, red
Ann green, orange, orange,red, black
Dan blue, blue, green, yellow
但是,如果我想创建一个color_seq
(无连接重复的颜色)而不是color_list
,则如何修改上述代码?谢谢!
name color_seq
-----------------------------------------------
John blue, yellow, red
Ann green, orange, red, black
Dan blue, green, yellow
如果允许非连续重复项,则必须仔细过滤。一种方法:
def filter(l):
l.append(None)
return ','.join([x for (i,x) in enumerate (l[:-1])
if l[i] != l[i+1]])
out=df.groupby('name')['color'].apply(list).apply(filter)
name
Ann green,orange,red,black
Dan blue,green,yellow
John blue,yellow,red
Name: color, dtype: object