因此,我认为在Python Pandas中解决的一个简单问题的解决方案非常缓慢,原因很明显。
想象两个数据帧:
stt = datetime.strptime(args.stt_str, '%Y-%m-%d')
stp = datetime.strptime(args.stp_str, '%Y-%m-%d')
time_loc = pd.date_range(stt, stp, freq='T')
col_names = ['time_loc' ,
'time_utc' ,
'ObsType' ,
'Station' ,
'WindDir' ,
'WindSpd' ,
'WindGst' ,
'T' ,
'Td' ,
'MSLP' ,
'PrecipRate' ,
'PrecipAccm9am',
'PrecipAccm24h',
'CldAWS' ,
'VisAWS' ,
'VisObs' ]
DF = pd.DataFrame(columns=col_names)
DF['time_loc'] = time_loc
这是第一个数据帧。它比下一个数据帧大得多。这个大数据帧有一分钟的时间步长。下面的数据帧";df";,具有与大数据帧相同的列名;DF";,但包含以不同间隔采样的数据,并且大小总是小于大的。大多数时候采样率较小;df";,数据帧是每30分钟一次,但有时它可能小于30分钟,但永远不会小于一分钟。好吧,希望你能明白这是怎么回事。
下面的代码是我目前正在做的工作,以在大数据帧中找到匹配的日期;DF";,对于小数据帧;df";。
df = aus_metar.parseMETAR(f,pathIn)
for l1, row in df.iterrows():
df_idx = df.iloc[l1]['time_loc']
DF_idx = DF['time_loc'].index[DF['time_loc'] == df_idx]
DF.iloc[DF_idx, DF.columns.get_loc('time_utc')] = df.iloc[l1]['time_utc']
DF.iloc[DF_idx, DF.columns.get_loc('ObsType')] = df.iloc[l1]['ObsType']
DF.iloc[DF_idx, DF.columns.get_loc('Station')] = df.iloc[l1]['Station']
DF.iloc[DF_idx, DF.columns.get_loc('WindDir')] = df.iloc[l1]['WindDir']
DF.iloc[DF_idx, DF.columns.get_loc('WindSpd')] = df.iloc[l1]['WindSpd']
DF.iloc[DF_idx, DF.columns.get_loc('WindGst')] = df.iloc[l1]['WindGst']
DF.iloc[DF_idx, DF.columns.get_loc('T')] = df.iloc[l1]['T']
DF.iloc[DF_idx, DF.columns.get_loc('Td')] = df.iloc[l1]['Td']
DF.iloc[DF_idx, DF.columns.get_loc('MSLP')] = df.iloc[l1]['MSLP']
DF.iloc[DF_idx, DF.columns.get_loc('PrecipRate')] = df.iloc[l1]['PrecipRate']
DF.iloc[DF_idx, DF.columns.get_loc('PrecipAccm9am')] = df.iloc[l1]['PrecipAccm9am']
DF.iloc[DF_idx, DF.columns.get_loc('PrecipAccm24h')] = df.iloc[l1]['PrecipAccm24h']
DF.iloc[DF_idx, DF.columns.get_loc('CldAWS')] = df.iloc[l1]['CldAWS']
DF.iloc[DF_idx, DF.columns.get_loc('VisAWS')] = df.iloc[l1]['VisAWS']
DF.iloc[DF_idx, DF.columns.get_loc('VisObs')] = df.iloc[l1]['VisObs']
这种解决方案是缓慢的。真的,真的很慢!
有人对如何加速甚至矢量化有什么好主意吗?
提前谢谢。
更新代码
stt = datetime.strptime('2018-12-06 11:30', '%Y-%m-%d %H:%M')
stp = datetime.strptime('2020-10-31 23:30', '%Y-%m-%d %H:%M')
col_names = ['time_loc' ,
'time_utc' ,
'ObsType' ,
'Station' ,
'WindDir' ,
'WindSpd' ,
'WindGst' ,
'T' ,
'Td' ,
'MSLP' ,
'PrecipRate' ,
'PrecipAccm9am',
'PrecipAccm24h',
'CldAWS' ,
'VisAWS' ,
'VisObs' ]
DF = []
for f in files:
df = pd.read_pickle(os.path.join(pklPath,f))
# df is matrix of 15 columns and N rows with datetimeindex
# max sampling rate (frequency) of N is no less than 1 minute
DF.append(df)
DF2 = pd.DataFrame(columns=col_names, index=idx)
DF2.loc[DF[0].index] = DF[0].values
结果:
***键错误:"[时间戳('2018-12-04 09:30:00'(,时间戳('2018-12-01 12:30:00'(,…时间戳(‘2018-12-02 13:00:00'(]不在索引中";
以下是我要做的,将日期时间设置为两个数据帧的索引,并将较小数据帧的值设置为较大数据帧的切片。这意味着列的排序方式相同。如果它们有相同的名称,你可以简单地:
big_df = big_df.reindex(columns=small_df.columns)
既然这是不可能的,下面是我的解决方案和一个例子。
import pandas as pd
import numpy as np
rng = np.random.default_rng()
n = 100
N = 600
vals_5min = rng.integers(-100, -20, n)
colors_5min = rng.choice(["blue", "red", "yellow"], n)
vals = rng.integers(1, 100, N)
colors = rng.choice(["green", "white", "brown"], N)
idx_5min = pd.date_range("2010-01-01", periods=n, freq='5min')
idx = pd.date_range("2010-01-01", periods=N, freq='1min')
df_5min = pd.DataFrame({
"vals_5min": vals_5min,
"color": colors_5min
},
index=idx_5min)
df = pd.DataFrame({"vals": vals, "color": colors}, index=idx)
df.loc[df_5min.index] = df_5min.values
逻辑发生在最后,使用时间索引,您只能在大df与小df具有相同索引的地方对其进行切片。然后将该切片设置为小df的值。在我的示例中,只有小df具有负值,并且颜色不匹配,在最后一次操作之后,头看起来是这样的。
vals color
2010-01-01 00:00:00 -49 yellow
2010-01-01 00:01:00 95 white
2010-01-01 00:02:00 67 brown
2010-01-01 00:03:00 43 brown
2010-01-01 00:04:00 81 white
2010-01-01 00:05:00 -80 blue
2010-01-01 00:06:00 88 green
2010-01-01 00:07:00 41 white
2010-01-01 00:08:00 54 green
2010-01-01 00:09:00 91 brown
2010-01-01 00:10:00 -66 red
2010-01-01 00:11:00 73 white
2010-01-01 00:12:00 78 brown
2010-01-01 00:13:00 18 white
2010-01-01 00:14:00 69 white
2010-01-01 00:15:00 -53 red
2010-01-01 00:16:00 76 white
2010-01-01 00:17:00 16 green
2010-01-01 00:18:00 54 brown
2010-01-01 00:19:00 12 green
2010-01-01 00:20:00 -37 yellow