我有以下DataFrame
import pandas as pd, numpy as np
pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
我想在每行的时间后5分钟提取中间引号。不可能得到精确的5分钟差,所以最接近5分钟的时间(必须小于(是优选的。
所需输出
pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285],
'timestamp':['2020-01-02 09:59:50.024000','2020-01-02 09:59:50.623000','2020-01-02 10:00:15.124000','2020-01-02 10:00:15.148000','2020-01-02 10:00:42.552000','2020-01-02 10:04:49.782000','2020-01-02 10:05:14.896000','2020-01-02 10:05:41.719000','2020-01-02 10:05:45.050000','2020-01-02 10:08:19.170000'],
'timestamp_5_min':['2020-01-02 10:04:49.782000','2020-01-02 10:04:49.782000','2020-01-02 10:05:14.896000','2020-01-02 10:05:14.896000','2020-01-02 10:05:41.719000','nan','nan','nan','nan','nan'],
'midquote_5_min':[14.360,14.305,14.235,14.295,14.285,'nan','nan','nan','nan','nan']})
我尝试了以下代码,但它不起作用
df['timestamp'] = pd.to_datetime(df.Date + ' ' + df.Time)
df.sort_values(by=['timestamp'], inplace=True)
df['midquote'].fillna(0, inplace=True)
mid = df[['Instrument','timestamp','midquote']]
mid['5_min_timestamp'] = mid.timestamp
mid.rename(columns={'midquote':'5_min_midquote'}, inplace=True)
pd.merge_asof(df, mid, on='timestamp', by=['Instrument'], tolerance=pd.Timedelta("5min"), direction='forward', allow_exact_matches=False)
此外,如果我没有sort_values
,它将给出left keys must be sorted
的错误
请帮忙。
谢谢。
问题的关键是使用.searchsorted
返回列Time
和列Time
+5分钟(称为timestamp_5_min
(之间最近日期/时间的索引+1。从那里开始,我做了一些混乱的合并、列重命名和列删除,这可能更干净,但可以完成任务。
输入:
import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
代码:
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
df['timestamp_5_min'] = pd.to_datetime(df['Time'] + pd.Timedelta(minutes=5))
df['timestamp_5_min_new'] = df['Time'].searchsorted(df['timestamp_5_min'])-1
df1 = df[['Time', 'midquote']]
df2 = pd.merge(df,df1,how='left',left_on='timestamp_5_min_new', right_index=True)
df2.loc[df2['timestamp_5_min_new'] == max(df2['timestamp_5_min_new']), ['timestamp_5_min_new', 'Time_y', 'midquote_y']] = np.nan
df2 = df2.drop(['timestamp_5_min_new','timestamp_5_min', 'midquote_x'], axis=1).rename({'Time_x': 'Time', 'Time_y': 'timestamp_5_min', 'midquote_y' : 'midquote'}, axis=1)
df2
输出:
Instrument Date Time timestamp_5_min midquote
0 A2M 2020-01-02 2020-07-16 09:59:50.024 2020-07-16 10:04:49.782 14.360
1 A2M 2020-01-02 2020-07-16 09:59:50.623 2020-07-16 10:04:49.782 14.360
2 A2M 2020-01-02 2020-07-16 10:00:15.124 2020-07-16 10:05:14.896 14.305
3 A2M 2020-01-02 2020-07-16 10:00:15.148 2020-07-16 10:05:14.896 14.305
4 A2M 2020-01-02 2020-07-16 10:00:42.552 2020-07-16 10:05:41.719 14.235
5 A2M 2020-01-02 2020-07-16 10:04:49.782 NaT NaN
6 A2M 2020-01-02 2020-07-16 10:05:14.896 NaT NaN
7 A2M 2020-01-02 2020-07-16 10:05:41.719 NaT NaN
8 A2M 2020-01-02 2020-07-16 10:05:45.050 NaT NaN
9 A2M 2020-01-02 2020-07-16 10:08:19.170 NaT NaN
按仪器分组的答案#2。这使它更具挑战性,但下面的内容可以完成任务。
我已经更改了一些样本数据,因此现在有两个类别,并且我已经将时差要求更改为25秒。
步骤#1-将.groupby
、.transform
、'lambda x
和searchsorted
一起使用,以便找到最接近25 seconds
的时间值的索引(传递给ppd.Timedelta
的参数(。
步骤#2-我创建字典a
的那一行基本上是用来为Instrument
按组查找最大索引的,分别为3和9。
步骤#3-从那里,创建一个临时的,显示每个仪器的最大索引和最大索引>而不是CCD_ 15列。这本质上意味着在该组之后的25秒内没有行,因此它们在np.where()
语句中变为NaN
。从那里,只需清理列和列名。
输入:
import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Instrument':
['B3M','B3M','B3M','B3M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02',
'2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552',
'10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
代码:
df['timestamp_5_min'] = pd.to_datetime(df['Time'] + pd.Timedelta(seconds=25))
df['timestamp_5_min_new'] = df.groupby(['Instrument'])['Time'].transform(lambda x: x).searchsorted(df.groupby(['Instrument'])['timestamp_5_min'].transform(lambda x: x))
a = {v: k for k, v in df['Instrument'].drop_duplicates(keep='last').to_dict().items()}
df['temp'] = df['Instrument'].map(a)
df1 = df[['Time', 'midquote']]
df2 = pd.merge(df,df1,how='left',left_on='timestamp_5_min_new', right_index=True)
df2['Time_y'] = np.where((df2['timestamp_5_min_new'] > df2['temp']), np.datetime64('NaT'), df2['timestamp_5_min'])
df2['midquote_y'] = np.where((df2['timestamp_5_min_new'] > df2['temp']), np.nan, df2['midquote_y'])
df2 = df2.drop(['timestamp_5_min_new','timestamp_5_min', 'midquote_x', 'temp'], axis=1).rename({'Time_x': 'Time', 'Time_y': 'timestamp_5_min', 'midquote_y' : 'midquote'}, axis=1)
df2
输出:
Instrument Date Time timestamp_5_min midquote
0 B3M 2020-01-02 2020-07-17 09:59:50.024 2020-07-17 10:00:15.024 14.145
1 B3M 2020-01-02 2020-07-17 09:59:50.623 NaT NaN
2 B3M 2020-01-02 2020-07-17 10:00:15.124 NaT NaN
3 B3M 2020-01-02 2020-07-17 10:00:15.148 NaT NaN
4 A2M 2020-01-02 2020-07-17 10:00:42.552 2020-07-17 10:01:07.552 14.360
5 A2M 2020-01-02 2020-07-17 10:04:49.782 2020-07-17 10:05:14.782 14.305
6 A2M 2020-01-02 2020-07-17 10:05:14.896 2020-07-17 10:05:39.896 14.235
7 A2M 2020-01-02 2020-07-17 10:05:41.719 2020-07-17 10:06:06.719 14.285
8 A2M 2020-01-02 2020-07-17 10:05:45.050 2020-07-17 10:06:10.050 14.285
9 A2M 2020-01-02 2020-07-17 10:08:19.170 NaT NaN