pandas:对多索引数据帧重新采样



我有一个带有多索引的数据帧:"主题";以及";日期时间";。每一行对应一个主题和一个日期时间,数据帧的列对应于各种测量。

每个受试者的天数范围不同,给定受试者可能会缺少某些天数(见示例(。此外,受试者在一天内可以有一个或多个值。

我想重新采样数据帧,以便:

  • 每个受试者每天只有一行(我不在乎一天中的时间(
  • 每个列值是当天的最后一个非NaN(如果当天没有值,则为NaN(
  • 不会创建或保留任何列上没有值的天数

例如,以下数据帧示例:

a       b
subject  datetime                        
patient1 2018-01-01 00:00:00  2.0    high
2018-01-01 01:00:00  NaN  medium
2018-01-01 02:00:00  6.0     NaN
2018-01-01 03:00:00  NaN     NaN
2018-01-02 00:00:00  4.3     low
patient2 2018-01-01 00:00:00  NaN  medium
2018-01-01 02:00:00  NaN     NaN
2018-01-01 03:00:00  5.0     NaN
2018-01-03 00:00:00  9.0     NaN
2018-01-04 02:00:00  NaN     NaN

应返回:

a       b
subject  datetime                        
patient1 2018-01-01 00:00:00  6.0  medium
2018-01-02 00:00:00  4.3     low
patient2 2018-01-01 00:00:00  5.0  medium
2018-01-03 00:00:00  9.0     NaN

我花了太多时间试图使用带有"pad"选项的重采样来获得这个结果,但我总是会得到错误或不是我想要的结果。有人能帮忙吗?

注意:以下是创建示例数据帧的代码:

import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([['patient1', 'patient2'], pd.date_range('20180101', periods=4,
freq='h')])
df = pd.DataFrame({'a': [2, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 5], 'b': ['high', 'medium', np.nan, np.nan, 'medium', 'low', np.nan, np.nan]},
index=index)
df.index.names = ['subject', 'datetime']
df = df.drop(df.index[5])
df.at[('patient2', '2018-01-03 00:00:00'), 'a'] = 9
df.at[('patient2', '2018-01-04 02:00:00'), 'a'] = None
df.at[('patient1', '2018-01-02 00:00:00'), 'a'] = 4.3
df.at[('patient1', '2018-01-02 00:00:00'), 'b'] = 'low'
df = df.sort_index(level=['subject', 'datetime'])

让我们floor是每日频率上的datetime,然后groupbysubject+flored timestamp上的数据帧,agg使用last,最后drop是具有所有NaN's:的行

i = pd.to_datetime(df.index.get_level_values(1)).floor('d')
df1 = df.groupby(['subject', i]).agg('last').dropna(how='all')

a       b
subject  datetime               
patient1 2018-01-01  6.0  medium
2018-01-02  4.3     low
patient2 2018-01-01  5.0  medium
2018-01-03  9.0     NaN
# drop a et b we don't need them when they ='re both na
df = df.reset_index().dropna(subset=["a", "b"], how="all")
#add a day columns we need it to keep last value
df["dt_day"] = df["datetime"].dt.date
#d1 result dataframe which we add a et b

d1 = df.copy().drop_duplicates(subset=["subject", "dt_day"]).loc[:, ["subject", "datetime"]].reset_index(drop=True)
#add a et b to ou dataframe result
for col in ["a", "b"]:
d1.loc[:,col] = (df.copy().
dropna(subset=[col]).drop_duplicates(subset=["subject", "dt_day"], keep="last")[col]
.reset_index(drop=True))
Wall time: 24 ms
@Shubham Sharma code => Wall time: 2.94 ms
subject   datetime    a       b
0  patient1 2018-01-01  6.0  medium
1  patient1 2018-01-02  4.3     low
2  patient2 2018-01-01  5.0  medium
3  patient2 2018-01-03  9.0     NaN

谢谢你的提问:(

这应该完成任务:

def day_agg(series_):
try:
return series_.dropna().iloc[-1]
except IndexError:
return float("nan")
df = df.reset_index().sort_values("datetime")
df.groupby([df["subject"],df.datetime.map(lambda x:datetime(year=x.year,month=x.month,day=x.day))])
.agg({"a":day_agg, "b":day_agg})
.dropna(how="all")

相关内容

最新更新