检查一行上的日期是否早于下一行的另一个日期



我在Python中有以下代码:

import pandas as pd
import numpy as np
date_rng = pd.date_range(start='5/18/2019', end='7/22/2020', freq='S')
df = pd.DataFrame(date_rng, columns=['start_timestamp'])
df['end_timestamp'] = date_rng
df['start_timestamp'] = np.random.randint(1589760000,1595376000,size=(len(date_rng)))
df['end_timestamp'] = np.random.randint(1589760000,1595376000,size=(len(date_rng)))
df = df[(df.end_timestamp/df.start_timestamp<=1.000009)&(df.end_timestamp/df.start_timestamp>=1.000001)]
df = df.sort_values(by=['start_timestamp','end_timestamp'])
df['start_timestamp'] = pd.to_datetime(df['start_timestamp'],unit='s')
df['end_timestamp'] = pd.to_datetime(df['end_timestamp'],unit='s')

因此,我有以下数据帧:

start_timestamp     end_timestamp
2020-05-18 00:00:30 2020-05-18 00:54:07
2020-05-18 00:01:40 2020-05-18 03:50:39
2020-05-18 00:02:08 2020-05-18 02:39:41
2020-05-18 00:04:01 2020-05-18 00:47:25
2020-05-18 00:04:01 2020-05-18 02:26:50
2020-05-18 00:04:44 2020-05-18 02:17:53
.
.
.

我应该做些什么来确保数据集中的每个end_timestamp都是其下一行的start_timestamp之前的日期时间?

已实施的解决方案

我基本上将数据集转换为数组,按升序排列,然后将其转换回数据帧。它可能不是最优雅的解决方案,但它工作正确,并为我打算使用的内容生成了一致的数据。

import pandas as pd
import numpy as np
date_rng = pd.date_range(start='7/22/2019', end='7/22/2020', freq='S')
df = pd.DataFrame(date_rng, columns=['start_timestamp'])
df['end_timestamp'] = date_rng
df['start_timestamp'] = np.random.randint(1563753600,1595376000,size=(len(date_rng)))
df['end_timestamp'] = np.random.randint(1563753600,1595376000,size=(len(date_rng)))
df = df[(df.end_timestamp/df.start_timestamp<=1.0000009)&(df.end_timestamp/df.start_timestamp>=1.0000001)]
df = df.to_numpy()
df = df.reshape(df.shape[0]*2,1)
df = np.sort(df,axis=0)
df = df.reshape(int(df.shape[0]/2),2)
df = pd.DataFrame(df,columns=['start_timestamp','end_timestamp'])
df['start_timestamp'] = pd.to_datetime(df['start_timestamp'],unit='s')
df['end_timestamp'] = pd.to_datetime(df['end_timestamp'],unit='s')

对您的逻辑进行编码,一切都很好

  1. freq='S'没有意义,您将生成的行数与开始日期和结束日期之间的秒数一样多
  2. 在随机化开始时间之后,使用当前和下一行作为结束时间的随机函数的种子。这是清单理解吗
  3. 在距离开始和结束时获得UTC秒数方面更聪明一些

import pandas as pd
import numpy as np
from datetime import datetime
# date_rng = pd.date_range(start='5/18/2019', end='7/22/2020', freq='S')
date_rng = pd.date_range(start='5/18/2019', end='5/19/2019', freq='min')
sec = [(date_rng.min() - datetime(1970, 1, 1)).total_seconds(),
(date_rng.max() - datetime(1970, 1, 1)).total_seconds() ]
df = pd.DataFrame(date_rng, columns=['start_timestamp'])
df['start_timestamp'] = np.random.randint(sec[0],sec[1],size=(len(date_rng)))
df = df.sort_values(by="start_timestamp")
l = df["start_timestamp"].tolist()  # get randomised start times
l[-1] = sec[1] # set last time to end of range
# randomise end time between two start times
df['end_timestamp'] = [np.random.randint(l[i], l[i+1]) if i<len(l)-1  and l[i]<l[i+1] else l[i] for i, s in enumerate(l)]
df['start_timestamp'] = pd.to_datetime(df['start_timestamp'],unit='s')
df['end_timestamp'] = pd.to_datetime(df['end_timestamp'],unit='s')

最新更新