如何从数据框中提取特定的范围并将其存储在另一个数据框中,然后从原始数据框中删除该范围| pandas



我有一些能源消耗的时间序列,如果消耗在一定范围内,我可以在某人度假时看到。我有这段代码来提取假期:

假数据:

values = [0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7]
df = pd.DataFrame(values, columns = ["values"])

所以df看起来像这样:

values
0     0.80
1     0.80
2     0.70
3     0.60
4     0.70
5     0.50
6     0.80
7     0.40
8     0.30
9     0.50
10    0.70
11    0.50
12    0.70
13    0.15
14    0.11
15    0.10
16    0.13
17    0.16
18    0.17
19    0.10
20    0.13
21    0.30
22    0.40
23    0.50
24    0.60
25    0.70

现在,给定这些变量,我想检测所有小于value_threshold至少5个时间步的后续值:

value_threshold = 0.2
count_threshold  = 5

检查哪些值低于阈值:

is_under_val_threshold =df["values"] < value_threshold

,它给了我这个:

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21    False
22    False
23    False
24    False
25    False
现在我可以隔离阈值以下的值:
subset_thre = df.loc[is_under_val_threshold, "values"]
13    0.15
14    0.11
15    0.10
16    0.13
17    0.16
18    0.17
19    0.10
20    0.13

由于这种情况可能会发生不止一次,而且并不总是超过5步,因此我将每个"sequence"进组:

thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    2
14    2
15    2
16    2
17    2
18    2
19    2
20    2
21    3
22    3
23    3
24    3
25    3

现在我想提取那些低于阈值超过5步的组,并在断点所在的地方创建新的数据框,这样在这个例子中我将有三个数据框。

我试过了:

标识组切换发生的位置:

identify_switch = thre_grouper.diff().to_frame()
index_of_switch = identify_switch.index[identify_switch['values'] == 1].tolist()

给出了切换发生位置的索引:

[13, 21]

对于这个例子,我至少可以按照我的意愿进行分割:

holidays_1 = df[index_of_switch[0]:index_of_switch[1]]
split_df_1 = df[:index_of_switch[0]]
split_df_2 = df[index_of_switch[1]:]

我的问题是,如何确保在对一个系列中非常可变的假期数量进行循环时,确保我将执行所有所需的分割

我添加了一些值,以便更好地理解这个答案是如何工作的。前几行在0.2以下,但不是连续的5或更多,所以不是"holidays",16-18是相同的,20-24满足条件。因此输出应该是"split_df_1"0-19,"holidays_1"二十至二十四日split_df_2"25-32 .

import pandas as pd
values = [0.1,0.15,0.1,0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.5,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7,0.1,0.15,0.1]
df = pd.DataFrame(values, columns = ["values"])
df
#    values
#0     0.10
#1     0.15
#2     0.10
#3     0.80
#4     0.80
#5     0.70
#6     0.60
#7     0.70
#8     0.50
#9     0.80
#10    0.40
#11    0.30
#12    0.50
#13    0.70
#14    0.50
#15    0.70
#16    0.15
#17    0.11
#18    0.10
#19    0.50
#20    0.13
#21    0.16
#22    0.17
#23    0.10
#24    0.13
#25    0.30
#26    0.40
#27    0.50
#28    0.60
#29    0.70
#30    0.10
#31    0.15
#32    0.10

您创建的条件和其他系列:

# conditions
value_threshold = 0.2
count_threshold = 5
# under value_threshold bool
is_under_val_threshold = df["values"] < value_threshold
# grouped
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()

计算thre_grouper中满足小于value_threshold且大于等于count_threshold条件的群数:

# if the first value is less than value_threshold, then start from first group (index 0)
if (df["values"].iloc[0] < value_threshold):
x = 0
# otherwise start from second (index 1)
else:
x = 1
# potential holiday groups are every other group
holidays = thre_grouper[thre_grouper.isin(thre_grouper.unique()[x::2])].value_counts(sort=False)
# get group number of those greater than count_threshold, and add start of dataframe and group above last
is_holiday = [0] + list(holidays[holidays >= count_threshold].to_frame().index) + [thre_grouper.max()+1]

循环创建dataframe:

# dictionary to add dataframes to
d = {}
for i in range(1, len(is_holiday)):
# split dataframes are those with group numbers between those in is_holiday list
d["split_df_"+str(i)] = df.loc[thre_grouper[
(thre_grouper > is_holiday[i-1]) &
(thre_grouper < is_holiday[i])].index]
# holiday dataframes are those that are in the is_holiday list but not the first or last
if not i in([0, len(is_holiday)-1]):
d["holiday_"+str(i)] = df.loc[thre_grouper[
thre_grouper == is_holiday[i]].index]

最新更新