按列分组,如果值存在于任何列中,则选择该值



对于下面的df,我如何按Group分组,过滤中的任何值为Date and Time :的行,并选择此日期值(它可能随机存在于其中一列中(。然后将值作为groupby之前的附加列分配给整个组。

df:

Column0         Column1         Column2         Column3        Group 
17  Date and Time : NaN             NaN             10/01/17,0900  A
18  NaN             NaN             NaN             NaN            A
19  Date and Time : 17/03/17,0800   NaN             NaN            B
18  NaN             NaN             02/06/17,1030   NaN            B
17  Date and Time : 02/06/17,1000   NaN             NaN            C
17  Date and Time : 05/07/17,1330   NaN             NaN            D
19  NaN             01/08/17,1400   Date and Time : NaN            D
...

预期输出如下(对于组A(:

Column0          Column1         Column2         Column3        Group     Date
0                                   ...                            A         10/01/17,0900
1                                   ...                            A         10/01/17,0900
2                                   ...                            A         10/01/17,0900
3                                   ...                            A         10/01/17,0900
...                            A         10/01/17,0900
16                                  ...                            A         10/01/17,0900
17  Date and Time : NaN             NaN             10/01/17,0900  A         10/01/17,0900
18                                  ...                            A         10/01/17,0900
19                                  ...                            A         10/01/17,0900

再现性示例:

data = [['Date and Time :', 'NaN', 'NaN', '10/01/17,0900', 'A'], 
['NaN','NaN','NaN','NaN','A'], 
['Date and Time :','17/03/17,0800','NaN','NaN','B'], 
['NaN', 'NaN', '02/06/17,1030','NaN','B'],
['Date and Time :', '02/06/17,1000','NaN','NaN','C'],
['Date and Time :','05/07/17,1330','NaN','NaN','D'],
['NaN', '01/08/17,1400', 'Date and Time :','NaN','D']
]
df = pd.DataFrame(data, columns = ['Column0', 'Column1', 'Column2', 'Column3', 'Group'])
df

我们将按Group分组,并在每个组上运行get_group_date函数:

def get_group_date(g):
row = g.loc[g['Column0'] == 'Date and Time :', cols].iloc[0]
return row[(row != 'NaN').values].values[0]

这将查找Column0中唯一具有正确字符串的行,并在该行中查找唯一非NaN列(在我们删除Column0Group之后(。

分组并应用我们的功能:

cols = df.columns.drop(['Column0', 'Group'])
grp_dates = pd.DataFrame(df.groupby('Group').apply(get_group_date)).reset_index()
grp_dates.columns = ['Group', 'Date']

我们现在已经为每个小组确定了正确的日期:

In [24]: grp_dates
Out[24]: 
Group           Date
0     A  10/01/17,0900
1     B  17/03/17,0800
2     C  02/06/17,1000
3     D  05/07/17,1330

现在只需进行外部联接即可将日期分配给整个组:

In [25]: pd.merge(df, grp_dates, how='outer')
Out[25]: 
Column0        Column1  ... Group           Date
0  Date and Time :            NaN  ...     A  10/01/17,0900
1              NaN            NaN  ...     A  10/01/17,0900
2  Date and Time :  17/03/17,0800  ...     B  17/03/17,0800
3              NaN            NaN  ...     B  17/03/17,0800
4  Date and Time :  02/06/17,1000  ...     C  02/06/17,1000
5  Date and Time :  05/07/17,1330  ...     D  05/07/17,1330
6              NaN  01/08/17,1400  ...     D  05/07/17,1330
[7 rows x 6 columns]

您可以使用pd.melt()并忽略日期和时间,因为您可以直接查找该行中唯一不为空且不是"的值;日期和时间":

df_melted = df.melt(id_vars=['Group']).dropna(subset=['value'])
df_melted = df_melted[df_melted['value'] != 'Date and Time :']
df['Date'] = df.reset_index().merge(df_melted, on=['Group'], how='left')['value']
Column0        Column1  ... Group           Date
0  Date and Time :            NaN  ...     A  10/01/17,0900
1              NaN            NaN  ...     A  10/01/17,0900
2  Date and Time :  17/03/17,0800  ...     B  17/03/17,0800
3              NaN            NaN  ...     B  02/06/17,1030
4  Date and Time :  02/06/17,1000  ...     C  17/03/17,0800
5  Date and Time :  05/07/17,1330  ...     D  02/06/17,1030
6              NaN  01/08/17,1400  ...     D  02/06/17,1000

如果出于某种原因;"日期和时间";在该行中提到,您可以使用以下内容:

df_melted = df.reset_index().melt(id_vars=['Group', 'index']).dropna(subset=['value'])
m = df_melted['value'] == 'Date and Time :'
l = df_melted[m]['index'].tolist()
df_melted = df_melted[df_melted['index'].isin(l) & ~m]
df['Date'] = df.reset_index().merge(df_melted, on=['Group'], how='left')['value']

最新更新