熊猫每隔30分钟计算一次平均值,时间间隔为+-10分钟



我有一个这样的数据帧:

df = pd.DataFrame(
{
"observation_time": ["2021-11-24 10:10:03+00:00", "2021-11-24 10:20:02+00:00", "2021-11-24 10:30:03+00:00", "2021-11-24 10:40:02+00:00", "2021-11-24 10:50:02+00:00", "2021-11-24 11:00:05+00:00", "2021-11-24 11:10:03+00:00", "2021-11-24 11:20:02+00:00", "2021-11-24 11:30:03+00:00", "2021-11-24 11:40:02+00:00"], 
"temp": [7.22, 7.33, 7.44, 7.5, 7.5, 7.5, 7.44, 7.61, 7.67, 7.78]
}
)
observation_time  temp
0 2021-11-24 10:10:03+00:00  7.22
1 2021-11-24 10:20:02+00:00  7.33
2 2021-11-24 10:30:03+00:00  7.44
3 2021-11-24 10:40:02+00:00  7.50
4 2021-11-24 10:50:02+00:00  7.50
5 2021-11-24 11:00:05+00:00  7.50
6 2021-11-24 11:10:03+00:00  7.44
7 2021-11-24 11:20:02+00:00  7.61
8 2021-11-24 11:30:03+00:00  7.67
9 2021-11-24 11:40:02+00:00  7.78

这个数据帧只是一个例子,不能保证数据帧每10分钟就有一个时间点,我可以每分钟或很长一段时间都有数据。

我想从"0"开始计算每30分钟+-10分钟间隔内的平均值;00〃;,在这种情况下;10:00:00";。

我试图使用Grouper:

df.groupby(pd.Grouper(key="observation_time", freq="30Min", offset="0m", label="right")).mean()

这给了我这样的结果:

temp
observation_time                   
2021-11-24 10:30:00+00:00  7.275000
2021-11-24 11:00:00+00:00  7.480000
2021-11-24 11:30:00+00:00  7.516667
2021-11-24 12:00:00+00:00  7.725000

从时间的角度来看,这很好,但它当然会计算30分钟间隔内的平均值。

相反,我想在+-10分钟的时间间隔内计算平均值。

例如,对于2021-11-24 10:30:00+00:00,在2021-11-24 10:20:00+00:002021-11-24 10:40:00+00:00之间的间隔中的temp的所有值中计算平均值,在这种情况下,这些值是7.337.44,并且平均值是7.385

最终结果应该是这样的:

temp
observation_time                   
2021-11-24 10:30:00+00:00  7.385
2021-11-24 11:00:00+00:00  7.5
2021-11-24 11:30:00+00:00  7.64

知道吗?感谢

EDIT:下面的答案是在假设每行对应10分钟间隔的情况下得出的。如果您的数据间隔不均匀,我们必须手动对数据集进行分类,以获得所需的输出:

import numpy as np
# the sampling will be computed in +/- 10 minutes from the bin
sampling_interval = np.timedelta64(10, 'm')
# get 30 minutes bins
bins_interval = "30min"
bins = df['observation_time'].dt.floor(bins_interval).unique()
avg_values = []
for grouped_bin in bins:
# subset the dataframe in the binned intervals
subset = df[df['observation_time'].between(
grouped_bin - sampling_interval, 
grouped_bin + sampling_interval
)]

avg_values.append({
'observation_time': grouped_bin,
'temp': subset['temp'].mean()
})
averaged_df = pd.DataFrame(avg_values)

我不确定这是最";蟒蛇;方式,但我会这样处理问题:

# we create an empty dictionary in which we'll store the computed avgs
# to turn into a DataFrame later
avg_values = []
# we iterate over the DataFrame starting at index 1 and skipping 3 rows at a time
for idx in range(1, len(df.index), 3):
# store the observation time in a separate variable
observation_time = df.loc[idx, 'observation_time']
# compute the mean between the rows before the current one, the
# current one, and the next one
avg_in_interval = np.nanmean([
df.loc[idx-1, 'temp'] if idx > 0 else np.nan,
df.loc[idx, 'temp'],
df.loc[idx+1, 'temp'] if idx < len(df.index)-1 else np.nan
])
# we append the two variables to the dictionary
avg_values.append({'observation_time': observation_time, 'temp': avg_in_interval})
# new DataFrame
averaged_df = pd.DataFrame(avg_values)

或者,以一种更紧凑和通用的方式,这样你就可以配置平均值的区间宽度,

interval_width = 3 # assuming it is an odd number
starting_idx = interval_width // 2
avg_values = []
for idx in range(starting_idx, len(df.index), interval_width):
avg_values.append({
'observation_time': df.loc[idx, 'observation_time'],
'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
})
averaged_df = pd.DataFrame(avg_values)

你也可以把它变成一个保持代码干净的函数:

def get_averaged_df(df, interval_width: int):
if interval_width % 2 == 0:
raise Error("interval_width must be an odd integer")
starting_idx = interval_width // 2
avg_values = []
for idx in range(starting_idx, len(df.index), interval_width):
avg_values.append({
'observation_time': df.loc[idx, 'observation_time'],
'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
})
return pd.DataFrame(avg_values)

averaged_df = get_averaged_df(df, 3)

最新更新