我有一些能源消耗的时间序列,如果消耗在一定范围内,我可以在某人度假时看到。我有这段代码来提取假期:
假数据:
values = [0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7]
df = pd.DataFrame(values, columns = ["values"])
所以df看起来像这样:
values
0 0.80
1 0.80
2 0.70
3 0.60
4 0.70
5 0.50
6 0.80
7 0.40
8 0.30
9 0.50
10 0.70
11 0.50
12 0.70
13 0.15
14 0.11
15 0.10
16 0.13
17 0.16
18 0.17
19 0.10
20 0.13
21 0.30
22 0.40
23 0.50
24 0.60
25 0.70
现在,给定这些变量,我想检测所有小于value_threshold至少5个时间步的后续值:
value_threshold = 0.2
count_threshold = 5
检查哪些值低于阈值:
is_under_val_threshold =df["values"] < value_threshold
,它给了我这个:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 False
22 False
23 False
24 False
25 False
现在我可以隔离阈值以下的值:
subset_thre = df.loc[is_under_val_threshold, "values"]
13 0.15
14 0.11
15 0.10
16 0.13
17 0.16
18 0.17
19 0.10
20 0.13
由于这种情况可能会发生不止一次,而且并不总是超过5步,因此我将每个"sequence"进组:
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 3
22 3
23 3
24 3
25 3
现在我想提取那些低于阈值超过5步的组,并在断点所在的地方创建新的数据框,这样在这个例子中我将有三个数据框。
我试过了:
标识组切换发生的位置:
identify_switch = thre_grouper.diff().to_frame()
index_of_switch = identify_switch.index[identify_switch['values'] == 1].tolist()
给出了切换发生位置的索引:
[13, 21]
对于这个例子,我至少可以按照我的意愿进行分割:
holidays_1 = df[index_of_switch[0]:index_of_switch[1]]
split_df_1 = df[:index_of_switch[0]]
split_df_2 = df[index_of_switch[1]:]
我的问题是,如何确保在对一个系列中非常可变的假期数量进行循环时,确保我将执行所有所需的分割
我添加了一些值,以便更好地理解这个答案是如何工作的。前几行在0.2以下,但不是连续的5或更多,所以不是"holidays",16-18是相同的,20-24满足条件。因此输出应该是"split_df_1"0-19,"holidays_1"二十至二十四日split_df_2"25-32 .
import pandas as pd
values = [0.1,0.15,0.1,0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.5,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7,0.1,0.15,0.1]
df = pd.DataFrame(values, columns = ["values"])
df
# values
#0 0.10
#1 0.15
#2 0.10
#3 0.80
#4 0.80
#5 0.70
#6 0.60
#7 0.70
#8 0.50
#9 0.80
#10 0.40
#11 0.30
#12 0.50
#13 0.70
#14 0.50
#15 0.70
#16 0.15
#17 0.11
#18 0.10
#19 0.50
#20 0.13
#21 0.16
#22 0.17
#23 0.10
#24 0.13
#25 0.30
#26 0.40
#27 0.50
#28 0.60
#29 0.70
#30 0.10
#31 0.15
#32 0.10
您创建的条件和其他系列:
# conditions
value_threshold = 0.2
count_threshold = 5
# under value_threshold bool
is_under_val_threshold = df["values"] < value_threshold
# grouped
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()
计算thre_grouper中满足小于value_threshold
且大于等于count_threshold
条件的群数:
# if the first value is less than value_threshold, then start from first group (index 0)
if (df["values"].iloc[0] < value_threshold):
x = 0
# otherwise start from second (index 1)
else:
x = 1
# potential holiday groups are every other group
holidays = thre_grouper[thre_grouper.isin(thre_grouper.unique()[x::2])].value_counts(sort=False)
# get group number of those greater than count_threshold, and add start of dataframe and group above last
is_holiday = [0] + list(holidays[holidays >= count_threshold].to_frame().index) + [thre_grouper.max()+1]
循环创建dataframe:
# dictionary to add dataframes to
d = {}
for i in range(1, len(is_holiday)):
# split dataframes are those with group numbers between those in is_holiday list
d["split_df_"+str(i)] = df.loc[thre_grouper[
(thre_grouper > is_holiday[i-1]) &
(thre_grouper < is_holiday[i])].index]
# holiday dataframes are those that are in the is_holiday list but not the first or last
if not i in([0, len(is_holiday)-1]):
d["holiday_"+str(i)] = df.loc[thre_grouper[
thre_grouper == is_holiday[i]].index]