如何加速xarray重采样(比pandas重采样慢得多)



这里是一个MWE,用于对xarraypandas中的时间序列进行重新采样。10Min重采样在xarray中花费6.8秒,在pandas中花费0.003秒。有没有办法在xarray中获得熊猫的速度?Pandas重采样似乎与周期无关,而xarray则随周期而缩放。

import numpy as np
import xarray as xr
import pandas as pd
import time
def make_ds(freq):
size = 100000
times = pd.date_range('2000-01-01', periods=size, freq=freq)
ds = xr.Dataset({
'foo': xr.DataArray(
data   = np.random.random(size),
dims   = ['time'],
coords = {'time': times}
)})
return ds
for f in ["1s", "1Min", "10Min"]:
ds = make_ds(f)
start = time.time()
ds_r = ds.resample({'time':"1H"}).mean()
print(f, 'xr', str(time.time() - start))
start = time.time()
ds_r = ds.to_dataframe().resample("1H").mean()
print(f, 'pd', str(time.time() - start))
: 1s xr 0.040313720703125
: 1s pd 0.0033435821533203125
: 1Min xr 0.5757267475128174
: 1Min pd 0.0025794506072998047
: 10Min xr 6.798743486404419
: 10Min pd 0.0029947757720947266

根据xarrayGH问题,这是一个实现问题。解决方案是在其他代码中进行重新采样(实际上是GroupBy(。我的解决方案是使用快速Pandas重采样,然后重建xarray数据集:

df_h = ds.to_dataframe().resample("1H").mean()  # what we want (quickly), but in Pandas form
vals = [xr.DataArray(data=df_h[c], dims=['time'], coords={'time':df_h.index}, attrs=ds[c].attrs) for c in df_h.columns]
ds_h = xr.Dataset(dict(zip(df_h.columns,vals)), attrs=ds.attrs)

最新更新