假设我在Pandas数据帧中有两列时间序列数据,即"a"one_answers"b"。我想创建第三列,指示当前时间段的列"a"和接下来5个时间段中任何一个时间段的栏"b"之间的差是否在减少2或更多之前增加了8或更多。理想情况下,我会使用某种形式的df.rolling(5(.apply((来完成这项工作,并且没有任何循环,但我一直遇到挑战。
为了演示起见,我已经用循环写出了逻辑,但如果有人能给我一些关于如何更高效或优雅地完成这项工作的指导,我真的很感激。事实上,数据帧和窗口会大得多。
df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10], 'b':[1,0,9,0,15,0,20,15,23,6]})
df['c'] = 0
window = 5
positive_thresh = 8
negative_thresh = -2
num_rows = df.shape[0]
for a_idx in range(num_rows):
a_start = df.iloc[a_idx,0]
b_roll = df.iloc[(a_idx + 1):max(a_idx + 1 + window,num_rows), 1]
deltas = b_roll - a_start
positives = deltas[deltas>=positive_thresh]
negatives = deltas[deltas<=negative_thresh]
first_pos_idx = positives.index[0] if len(positives) > 0 else num_rows
first_neg_idx = negatives.index[0] if len(negatives) > 0 else num_rows
if first_pos_idx < first_neg_idx:
df.iloc[a_idx,2] = 1
print(df)
a b c
0 1 1 1
1 2 0 0
2 3 9 0
3 4 0 1
4 5 15 0
5 6 0 1
6 7 20 1
7 8 15 1
8 9 23 0
9 10 6 0
这只是一个很难处理的口罩,但这里有一种方法:
from numpy.lib.stride_tricks import sliding_window_view
window = 5
n_rows = df.shape[0]
dfa = df.reindex(np.arange(df.shape[0] + window)) # Just so that the sliding view matches
b_roll = sliding_window_view(dfa.b, 5)[1:]
diff = (b_roll.T - df.a.values).T # diff next 5 "b" rows with current "a"
pos = (diff >= 8)
pos_idx = pos.argmax(1)
pos_idx[pos.sum(1) == 0] = n_rows # differ first idx vs. no occurences found
neg = (diff <= -2)
neg_idx = window - neg[:, ::-1].argmax(1) - 1 # getting the last occurence col-wise
neg_idx[neg.sum(1) == 0] = 0 # differ first idx vs. no occurences found
df["c"] = (pos_idx < neg_idx).astype(int)
如果你注意到的话,我建议的输出与你的不太匹配。我相信你的片段并不能完全代表你的描述,但我可能只是误解了逻辑中的一些东西。
输出:
a b c
0 1 1 0
1 2 0 1
2 3 9 1
3 4 0 1
4 5 15 0
5 6 0 0
6 7 20 0
7 8 15 1
8 9 23 0
9 10 6 0