Python Pandas矢量化从不同大小中查找匹配的日期时间



因此,我认为在Python Pandas中解决的一个简单问题的解决方案非常缓慢,原因很明显。

想象两个数据帧:

stt       = datetime.strptime(args.stt_str, '%Y-%m-%d')
stp       = datetime.strptime(args.stp_str, '%Y-%m-%d')
time_loc  = pd.date_range(stt, stp, freq='T')
col_names = ['time_loc'     ,
'time_utc'     ,
'ObsType'      ,
'Station'      ,
'WindDir'      ,
'WindSpd'      ,
'WindGst'      ,
'T'            ,
'Td'           ,
'MSLP'         ,
'PrecipRate'   ,
'PrecipAccm9am',
'PrecipAccm24h',
'CldAWS'       ,
'VisAWS'       ,
'VisObs'       ]
DF             = pd.DataFrame(columns=col_names)
DF['time_loc'] = time_loc

这是第一个数据帧。它比下一个数据帧大得多。这个大数据帧有一分钟的时间步长。下面的数据帧";df";,具有与大数据帧相同的列名;DF";,但包含以不同间隔采样的数据,并且大小总是小于大的。大多数时候采样率较小;df";,数据帧是每30分钟一次,但有时它可能小于30分钟,但永远不会小于一分钟。好吧,希望你能明白这是怎么回事。

下面的代码是我目前正在做的工作,以在大数据帧中找到匹配的日期;DF";,对于小数据帧;df";。

df = aus_metar.parseMETAR(f,pathIn)
for l1, row in df.iterrows():
df_idx = df.iloc[l1]['time_loc']
DF_idx = DF['time_loc'].index[DF['time_loc'] == df_idx]
DF.iloc[DF_idx, DF.columns.get_loc('time_utc')]      = df.iloc[l1]['time_utc']
DF.iloc[DF_idx, DF.columns.get_loc('ObsType')]       = df.iloc[l1]['ObsType']
DF.iloc[DF_idx, DF.columns.get_loc('Station')]       = df.iloc[l1]['Station']
DF.iloc[DF_idx, DF.columns.get_loc('WindDir')]       = df.iloc[l1]['WindDir']
DF.iloc[DF_idx, DF.columns.get_loc('WindSpd')]       = df.iloc[l1]['WindSpd']
DF.iloc[DF_idx, DF.columns.get_loc('WindGst')]       = df.iloc[l1]['WindGst']
DF.iloc[DF_idx, DF.columns.get_loc('T')]             = df.iloc[l1]['T']
DF.iloc[DF_idx, DF.columns.get_loc('Td')]            = df.iloc[l1]['Td']
DF.iloc[DF_idx, DF.columns.get_loc('MSLP')]          = df.iloc[l1]['MSLP']
DF.iloc[DF_idx, DF.columns.get_loc('PrecipRate')]    = df.iloc[l1]['PrecipRate']
DF.iloc[DF_idx, DF.columns.get_loc('PrecipAccm9am')] = df.iloc[l1]['PrecipAccm9am']
DF.iloc[DF_idx, DF.columns.get_loc('PrecipAccm24h')] = df.iloc[l1]['PrecipAccm24h']
DF.iloc[DF_idx, DF.columns.get_loc('CldAWS')]        = df.iloc[l1]['CldAWS']
DF.iloc[DF_idx, DF.columns.get_loc('VisAWS')]        = df.iloc[l1]['VisAWS']
DF.iloc[DF_idx, DF.columns.get_loc('VisObs')]        = df.iloc[l1]['VisObs']

这种解决方案是缓慢的。真的,真的很慢!

有人对如何加速甚至矢量化有什么好主意吗?

提前谢谢。

更新代码

stt       = datetime.strptime('2018-12-06 11:30', '%Y-%m-%d %H:%M')
stp       = datetime.strptime('2020-10-31 23:30', '%Y-%m-%d %H:%M')
col_names = ['time_loc'     ,
'time_utc'     ,
'ObsType'      ,
'Station'      ,
'WindDir'      ,
'WindSpd'      ,
'WindGst'      ,
'T'            ,
'Td'           ,
'MSLP'         ,
'PrecipRate'   ,
'PrecipAccm9am',
'PrecipAccm24h',
'CldAWS'       ,
'VisAWS'       ,
'VisObs'       ]
DF = []    
for f in files:
df = pd.read_pickle(os.path.join(pklPath,f))
# df is matrix of 15 columns and N rows with datetimeindex
# max sampling rate (frequency) of N is no less than 1 minute
DF.append(df)
DF2 = pd.DataFrame(columns=col_names, index=idx)
DF2.loc[DF[0].index] = DF[0].values

结果:

***键错误:"[时间戳('2018-12-04 09:30:00'(,时间戳('2018-12-01 12:30:00'(,…时间戳(‘2018-12-02 13:00:00'(]不在索引中";

以下是我要做的,将日期时间设置为两个数据帧的索引,并将较小数据帧的值设置为较大数据帧的切片。这意味着列的排序方式相同。如果它们有相同的名称,你可以简单地:

big_df = big_df.reindex(columns=small_df.columns)

既然这是不可能的,下面是我的解决方案和一个例子。

import pandas as pd
import numpy as np
rng = np.random.default_rng()
n = 100
N = 600
vals_5min = rng.integers(-100, -20, n)
colors_5min = rng.choice(["blue", "red", "yellow"], n)
vals = rng.integers(1, 100, N)
colors = rng.choice(["green", "white", "brown"], N)
idx_5min = pd.date_range("2010-01-01", periods=n, freq='5min')
idx = pd.date_range("2010-01-01", periods=N, freq='1min')
df_5min = pd.DataFrame({
"vals_5min": vals_5min,
"color": colors_5min
},
index=idx_5min)
df = pd.DataFrame({"vals": vals, "color": colors}, index=idx)
df.loc[df_5min.index] = df_5min.values

逻辑发生在最后,使用时间索引,您只能在大df与小df具有相同索引的地方对其进行切片。然后将该切片设置为小df的值。在我的示例中,只有小df具有负值,并且颜色不匹配,在最后一次操作之后,头看起来是这样的。

vals   color
2010-01-01 00:00:00   -49  yellow
2010-01-01 00:01:00    95   white
2010-01-01 00:02:00    67   brown
2010-01-01 00:03:00    43   brown
2010-01-01 00:04:00    81   white
2010-01-01 00:05:00   -80    blue
2010-01-01 00:06:00    88   green
2010-01-01 00:07:00    41   white
2010-01-01 00:08:00    54   green
2010-01-01 00:09:00    91   brown
2010-01-01 00:10:00   -66     red
2010-01-01 00:11:00    73   white
2010-01-01 00:12:00    78   brown
2010-01-01 00:13:00    18   white
2010-01-01 00:14:00    69   white
2010-01-01 00:15:00   -53     red
2010-01-01 00:16:00    76   white
2010-01-01 00:17:00    16   green
2010-01-01 00:18:00    54   brown
2010-01-01 00:19:00    12   green
2010-01-01 00:20:00   -37  yellow

最新更新