我有许多open
和close
datetime64列的门票数据框。下面是一个示例:
df = pd.DataFrame({
"open": ["03/27/14", "03/28/14", "03/31/14", "04/01/14", "04/02/14", "04/03/14", "04/04/14", "04/07/14", "04/08/14", "04/09/14", "05/10/14", "05/11/14", "05/14/14", "05/15/14", "05/16/14", "05/17/14", "05/21/14", "05/22/14", "05/25/14", "05/26/14"],
"close": ["04/01/14", "04/02/14", "04/03/14", "04/04/14", "04/07/14", "04/08/14", "04/09/14", "04/10/14", "04/11/14", "04/14/14", "05/15/14", "05/16/14", "05/17/14", "05/24/14", "05/24/14", "05/27/14", "05/27/14", "05/27/14", "05/28/14", "05/29/14"]
}).astype({
"open": "datetime64[ns]",
"close": "datetime64[ns]",
})
我需要知道哪些票是开放的,如果我可以有最多的n
开放票在同一时间。
注意:门票只在下午开放,只在上午关闭,所以如果一张票与另一张票有相同的关闭日期,那么这两张票是连续的,而不是重叠的。
例如,使用n=1
或n=2
或n=3
应该会得到以下结果:
en_date ex_date n=1 n=2 n=3
0 03/27/14 04/01/14 True True True
1 03/28/14 04/02/14 False True True
2 03/31/14 04/03/14 False False True
3 04/01/14 04/04/14 True True True
4 04/02/14 04/07/14 False True True
5 04/03/14 04/08/14 False False True
6 04/04/14 04/09/14 True True True
7 04/07/14 04/10/14 False True True
8 04/08/14 04/11/14 False False True
9 04/09/14 04/14/14 True True True
10 05/10/14 05/15/14 True True True
11 05/11/14 05/16/14 False True True
12 05/14/14 05/17/14 False False True
13 05/15/14 05/24/14 True True True
14 05/16/14 05/24/14 False True True
15 05/17/14 05/27/14 False False True
16 05/21/14 05/27/14 False False False
17 05/22/14 05/27/14 False False False
18 05/25/14 05/28/14 True True True
19 05/26/14 05/29/14 False True True
我可以做一个for循环,但dfs有数百万行,这将花费很长时间。
哪个是最快的计算方法?
n = 2
# Number of previous tickets overlapping
df['over'] = np.sum(np.tril(np.subtract.outer(df['open'].astype(str).str.replace('-', '').astype(int).values,df['close'].astype(str).str.replace('-', '')
def reduce_it(acc, row):
i = acc[1]
prev_open = sum([acc[0][i-p] for p in range(1,row+1)])
acc[0][i] = 1 if prev_open < n else 0
return (acc[0], i+1)
df["n=2"], _ = reduce(reduce_it, df["over"].values, (np.empty(df.shape[0], dtype=np.bool), 0))
注意:目前这是我找到的最快的方法。如果有人找到一个更快的代码,很高兴改变接受的答案。