我有一个混合数据类型列的数据帧,我应用了pd.to_datetime(df['DATE'],coerce=True)
,得到了下面的数据帧
CUSTOMER_name DATE
abc NaT
def NaT
abc 2010-04-15 19:09:08
def 2011-01-25 15:29:37
abc 2010-04-10 12:29:02
现在我想应用一些agg函数(在这里,我想按mailid分组,并取Date的min((来查找mailid的第一笔交易的日期(。
df['DATE'] = [x.date() for x in df['DATE']]
#Here the value goes to
CUSTOMER_name DATE
abc 0001-255-255 ####how??
def 0001-255-255 ###How??
abc 2010-04-15
def 2011-01-25
abc 2010-04-10
#Then when i do a groupby and applying min on DATE
df.groupby('CUSTOMER_name')['DATE'].min()
#CUSTOMER_name DATE
abc 0001-255-255 ####i want 2010-04-10
def 0001-255-255 ### i want 2011-01-25
所以,有人能建议一下,在转换为date((以及执行groupby和min((时,如何处理这个NaT,如何排除NaT进行计算。
如果对于任何customer_name,DATE字段中只有NaT,那么在groupby和min((中,我可以使用nan或Null值
假设您从以下内容开始:
df = pd.DataFrame({
'CUSTOMER_name': ['abc', 'def', 'abc', 'def', 'abc', 'fff'],
'DATE': ['NaT', 'NaT', '2010-04-15 19:09:08', '2011-01-25 15:29:37', '2010-04-10 12:29:02', 'NaT']})
df.DATE = pd.to_datetime(df.DATE)
(注意,唯一的区别是添加映射到NaT
的fff
(。
然后以下内容满足您的要求:
>>> pd.to_datetime(df.DATE.groupby(df.CUSTOMER_name).min())
CUSTOMER_name
abc 2010-04-10 12:29:02
def 2011-01-25 15:29:37
fff NaT
Name: DATE, dtype: datetime64[ns]
这是因为groupby
-min
已经在适用的情况下排除了丢失的数据(尽管改变了结果的格式(,而最终的pd.to_datetime
再次将结果强制为datetime
。
要获得结果的日期部分(我认为这是一个单独的问题(,请使用.dt.date
:
>>> pd.to_datetime(df.DATE.groupby(df.CUSTOMER_name).min()).dt.date
Out[19]:
CUSTOMER_name
abc 2010-04-10
def 2011-01-25
fff NaN
Name: DATE, dtype: object
这里有一个替代解决方案:
数据:
In [96]: x
Out[96]:
CUSTOMER_name DATE
0 abc T
1 def N
2 abc 2010-04-15 19:09:08
3 def 2011-01-25 15:29:37
4 abc 2010-04-10 12:29:02
5 fff sa
解决方案:
In [100]: (x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
.....: .groupby('CUSTOMER_name')['D']
.....: .min()
.....: .astype('datetime64[ns]')
.....: )
Out[100]:
CUSTOMER_name
abc 2010-04-10
def 2011-01-25
fff NaT
Name: D, dtype: datetime64[ns]
解释:
首先,让我们创建一个具有截断时间部分的新虚拟列D
:
In [97]: x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
Out[97]:
CUSTOMER_name DATE D
0 abc T NaT
1 def N NaT
2 abc 2010-04-15 19:09:08 2010-04-15
3 def 2011-01-25 15:29:37 2011-01-25
4 abc 2010-04-10 12:29:02 2010-04-10
5 fff sa NaT
现在我们可以按CUSTOMER_name
分组,并为每组计算最小D
:
In [101]: x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]')).groupby('CUSTOMER_name')['D'].min()
Out[101]:
CUSTOMER_name
abc 1.270858e+18
def 1.295914e+18
fff NaN
Name: D, dtype: float64
并最终将得到的列转换为datetime64[ns]
数据类型:
In [102]: (x.assign(D=pd.to_datetime(x.DATE, errors='coerce').values.astype('<M8[D]'))
.....: .groupby('CUSTOMER_name')['D']
.....: .min()
.....: .astype('datetime64[ns]')
.....: )
Out[102]:
CUSTOMER_name
abc 2010-04-10
def 2011-01-25
fff NaT
Name: D, dtype: datetime64[ns]