对于下面的df
,我如何按Group
分组,过滤行中的任何值为Date and Time :
的行,并选择此日期值(它可能随机存在于其中一列中(。然后将值作为groupby之前的附加列分配给整个组。
df:
Column0 Column1 Column2 Column3 Group
17 Date and Time : NaN NaN 10/01/17,0900 A
18 NaN NaN NaN NaN A
19 Date and Time : 17/03/17,0800 NaN NaN B
18 NaN NaN 02/06/17,1030 NaN B
17 Date and Time : 02/06/17,1000 NaN NaN C
17 Date and Time : 05/07/17,1330 NaN NaN D
19 NaN 01/08/17,1400 Date and Time : NaN D
...
预期输出如下(对于组A(:
Column0 Column1 Column2 Column3 Group Date
0 ... A 10/01/17,0900
1 ... A 10/01/17,0900
2 ... A 10/01/17,0900
3 ... A 10/01/17,0900
... A 10/01/17,0900
16 ... A 10/01/17,0900
17 Date and Time : NaN NaN 10/01/17,0900 A 10/01/17,0900
18 ... A 10/01/17,0900
19 ... A 10/01/17,0900
再现性示例:
data = [['Date and Time :', 'NaN', 'NaN', '10/01/17,0900', 'A'],
['NaN','NaN','NaN','NaN','A'],
['Date and Time :','17/03/17,0800','NaN','NaN','B'],
['NaN', 'NaN', '02/06/17,1030','NaN','B'],
['Date and Time :', '02/06/17,1000','NaN','NaN','C'],
['Date and Time :','05/07/17,1330','NaN','NaN','D'],
['NaN', '01/08/17,1400', 'Date and Time :','NaN','D']
]
df = pd.DataFrame(data, columns = ['Column0', 'Column1', 'Column2', 'Column3', 'Group'])
df
我们将按Group
分组,并在每个组上运行get_group_date
函数:
def get_group_date(g):
row = g.loc[g['Column0'] == 'Date and Time :', cols].iloc[0]
return row[(row != 'NaN').values].values[0]
这将查找Column0中唯一具有正确字符串的行,并在该行中查找唯一非NaN列(在我们删除Column0
和Group
之后(。
分组并应用我们的功能:
cols = df.columns.drop(['Column0', 'Group'])
grp_dates = pd.DataFrame(df.groupby('Group').apply(get_group_date)).reset_index()
grp_dates.columns = ['Group', 'Date']
我们现在已经为每个小组确定了正确的日期:
In [24]: grp_dates
Out[24]:
Group Date
0 A 10/01/17,0900
1 B 17/03/17,0800
2 C 02/06/17,1000
3 D 05/07/17,1330
现在只需进行外部联接即可将日期分配给整个组:
In [25]: pd.merge(df, grp_dates, how='outer')
Out[25]:
Column0 Column1 ... Group Date
0 Date and Time : NaN ... A 10/01/17,0900
1 NaN NaN ... A 10/01/17,0900
2 Date and Time : 17/03/17,0800 ... B 17/03/17,0800
3 NaN NaN ... B 17/03/17,0800
4 Date and Time : 02/06/17,1000 ... C 02/06/17,1000
5 Date and Time : 05/07/17,1330 ... D 05/07/17,1330
6 NaN 01/08/17,1400 ... D 05/07/17,1330
[7 rows x 6 columns]
您可以使用pd.melt()
并忽略日期和时间,因为您可以直接查找该行中唯一不为空且不是"的值;日期和时间":
df_melted = df.melt(id_vars=['Group']).dropna(subset=['value'])
df_melted = df_melted[df_melted['value'] != 'Date and Time :']
df['Date'] = df.reset_index().merge(df_melted, on=['Group'], how='left')['value']
Column0 Column1 ... Group Date
0 Date and Time : NaN ... A 10/01/17,0900
1 NaN NaN ... A 10/01/17,0900
2 Date and Time : 17/03/17,0800 ... B 17/03/17,0800
3 NaN NaN ... B 02/06/17,1030
4 Date and Time : 02/06/17,1000 ... C 17/03/17,0800
5 Date and Time : 05/07/17,1330 ... D 02/06/17,1030
6 NaN 01/08/17,1400 ... D 02/06/17,1000
如果出于某种原因;"日期和时间";在该行中提到,您可以使用以下内容:
df_melted = df.reset_index().melt(id_vars=['Group', 'index']).dropna(subset=['value'])
m = df_melted['value'] == 'Date and Time :'
l = df_melted[m]['index'].tolist()
df_melted = df_melted[df_melted['index'].isin(l) & ~m]
df['Date'] = df.reset_index().merge(df_melted, on=['Group'], how='left')['value']