正在提取距离当前行5分钟的值



我有以下DataFrame

import pandas as pd, numpy as np
pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'], 
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'], 
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})

我想在每行的时间后5分钟提取中间引号。不可能得到精确的5分钟差,所以最接近5分钟的时间(必须小于(是优选的。

所需输出

pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'], 
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'], 
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285], 
'timestamp':['2020-01-02 09:59:50.024000','2020-01-02 09:59:50.623000','2020-01-02 10:00:15.124000','2020-01-02 10:00:15.148000','2020-01-02 10:00:42.552000','2020-01-02 10:04:49.782000','2020-01-02 10:05:14.896000','2020-01-02 10:05:41.719000','2020-01-02 10:05:45.050000','2020-01-02 10:08:19.170000'],
'timestamp_5_min':['2020-01-02 10:04:49.782000','2020-01-02 10:04:49.782000','2020-01-02 10:05:14.896000','2020-01-02 10:05:14.896000','2020-01-02 10:05:41.719000','nan','nan','nan','nan','nan'],
'midquote_5_min':[14.360,14.305,14.235,14.295,14.285,'nan','nan','nan','nan','nan']})

我尝试了以下代码,但它不起作用

df['timestamp'] = pd.to_datetime(df.Date + ' ' + df.Time)
df.sort_values(by=['timestamp'], inplace=True)
df['midquote'].fillna(0, inplace=True)
mid = df[['Instrument','timestamp','midquote']]
mid['5_min_timestamp'] = mid.timestamp
mid.rename(columns={'midquote':'5_min_midquote'}, inplace=True)
pd.merge_asof(df, mid, on='timestamp', by=['Instrument'], tolerance=pd.Timedelta("5min"), direction='forward', allow_exact_matches=False)

此外,如果我没有sort_values,它将给出left keys must be sorted的错误

请帮忙。

谢谢。

问题的关键是使用.searchsorted返回列Time和列Time+5分钟(称为timestamp_5_min(之间最近日期/时间的索引+1。从那里开始,我做了一些混乱的合并、列重命名和列删除,这可能更干净,但可以完成任务。

输入:

import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'], 
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'], 
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})

代码:

df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
df['timestamp_5_min'] = pd.to_datetime(df['Time'] + pd.Timedelta(minutes=5))
df['timestamp_5_min_new'] = df['Time'].searchsorted(df['timestamp_5_min'])-1
df1 = df[['Time', 'midquote']]
df2 = pd.merge(df,df1,how='left',left_on='timestamp_5_min_new', right_index=True)
df2.loc[df2['timestamp_5_min_new'] == max(df2['timestamp_5_min_new']), ['timestamp_5_min_new', 'Time_y', 'midquote_y']] = np.nan
df2 = df2.drop(['timestamp_5_min_new','timestamp_5_min', 'midquote_x'], axis=1).rename({'Time_x': 'Time', 'Time_y': 'timestamp_5_min', 'midquote_y' : 'midquote'}, axis=1)
df2

输出:

Instrument Date Time                    timestamp_5_min         midquote
0   A2M 2020-01-02  2020-07-16 09:59:50.024 2020-07-16 10:04:49.782 14.360
1   A2M 2020-01-02  2020-07-16 09:59:50.623 2020-07-16 10:04:49.782 14.360
2   A2M 2020-01-02  2020-07-16 10:00:15.124 2020-07-16 10:05:14.896 14.305
3   A2M 2020-01-02  2020-07-16 10:00:15.148 2020-07-16 10:05:14.896 14.305
4   A2M 2020-01-02  2020-07-16 10:00:42.552 2020-07-16 10:05:41.719 14.235
5   A2M 2020-01-02  2020-07-16 10:04:49.782 NaT                     NaN
6   A2M 2020-01-02  2020-07-16 10:05:14.896 NaT                     NaN
7   A2M 2020-01-02  2020-07-16 10:05:41.719 NaT                     NaN
8   A2M 2020-01-02  2020-07-16 10:05:45.050 NaT                     NaN
9   A2M 2020-01-02  2020-07-16 10:08:19.170 NaT                     NaN

按仪器分组的答案#2。这使它更具挑战性,但下面的内容可以完成任务。

我已经更改了一些样本数据,因此现在有两个类别,并且我已经将时差要求更改为25秒。

步骤#1-将.groupby.transform'lambda xsearchsorted一起使用,以便找到最接近25 seconds的时间值的索引(传递给ppd.Timedelta的参数(。

步骤#2-我创建字典a的那一行基本上是用来为Instrument按组查找最大索引的,分别为3和9。

步骤#3-从那里,创建一个临时的,显示每个仪器的最大索引和最大索引>而不是CCD_ 15列。这本质上意味着在该组之后的25秒内没有行,因此它们在np.where()语句中变为NaN。从那里,只需清理列和列名。

输入:

import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Instrument':
['B3M','B3M','B3M','B3M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02',
'2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552',
'10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])

代码:

df['timestamp_5_min'] = pd.to_datetime(df['Time'] + pd.Timedelta(seconds=25))
df['timestamp_5_min_new'] = df.groupby(['Instrument'])['Time'].transform(lambda x: x).searchsorted(df.groupby(['Instrument'])['timestamp_5_min'].transform(lambda x: x))
a = {v: k for k, v in df['Instrument'].drop_duplicates(keep='last').to_dict().items()}
df['temp'] = df['Instrument'].map(a)
df1 = df[['Time', 'midquote']]
df2 = pd.merge(df,df1,how='left',left_on='timestamp_5_min_new', right_index=True)
df2['Time_y'] = np.where((df2['timestamp_5_min_new'] > df2['temp']), np.datetime64('NaT'), df2['timestamp_5_min'])
df2['midquote_y'] = np.where((df2['timestamp_5_min_new'] > df2['temp']), np.nan, df2['midquote_y'])
df2 = df2.drop(['timestamp_5_min_new','timestamp_5_min', 'midquote_x', 'temp'], axis=1).rename({'Time_x': 'Time', 'Time_y': 'timestamp_5_min', 'midquote_y' : 'midquote'}, axis=1)
df2

输出:

Instrument  Date    Time                    timestamp_5_min       midquote
0   B3M     2020-01-02  2020-07-17 09:59:50.024 2020-07-17 10:00:15.024 14.145
1   B3M     2020-01-02  2020-07-17 09:59:50.623 NaT                 NaN
2   B3M     2020-01-02  2020-07-17 10:00:15.124 NaT                 NaN
3   B3M     2020-01-02  2020-07-17 10:00:15.148 NaT                 NaN
4   A2M     2020-01-02  2020-07-17 10:00:42.552 2020-07-17 10:01:07.552 14.360
5   A2M     2020-01-02  2020-07-17 10:04:49.782 2020-07-17 10:05:14.782 14.305
6   A2M     2020-01-02  2020-07-17 10:05:14.896 2020-07-17 10:05:39.896 14.235
7   A2M     2020-01-02  2020-07-17 10:05:41.719 2020-07-17 10:06:06.719 14.285
8   A2M     2020-01-02  2020-07-17 10:05:45.050 2020-07-17 10:06:10.050 14.285
9   A2M     2020-01-02  2020-07-17 10:08:19.170 NaT                 NaN

最新更新