有效地将功能应用于两个PANDAS数据范围



我有两个带有相同索引和列名称的DateTimeIndexed DataFrames。并且每条大约826万行和44列,加入数据范围,然后使用10分钟的时间间隔应用大约6884组。然后将匹配列对迭代,每个组和列对返回一个值。

下面的解决方案可行,并在Xeon E5-2697 V3上需要34分钟,并且所有数据范围都可以适合内存。

我认为应该有一种更有效的方法可以使用两个数据范围来计算此功能,也许使用DASK?

尽管对我来说尚不清楚如何为DASK DATAFRAME进行基于时间的组。

def circular_mean(burst_veldirection, burst_velspeed):
    x = y = 0.
    for angle, weight in zip(burst_veldirection.values, burst_velspeed.values):
        x += math.cos(math.radians(angle)) * weight
        y += math.sin(math.radians(angle)) * weight
    mean = math.degrees(math.atan2(y, x))
    if mean < 0:
        mean = 360 + mean
    return mean
def circ_mean(df):
    results = []
    for x in range(0,45):
        results.append(circular_mean(df[str(x)], df[str(x) + 'velspeed']))
    return results
burst_veldirection_velspeed = burst_veldirection.join(burst_velspeed, rsuffix='velspeed')
result = burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)

示例简短的HDF文件,其中包含覆盖23分钟的前10,000个记录

这并不能使您脱离groupby,但是只要从做所有元素方面的事情就转移到numpy函数对我来说大约是8倍的速度提升。

def circ_mean2(df):
    df2 = df.iloc[:, 45:].copy()
    df1 = df.iloc[:, :45].copy()
    x = np.sum(np.cos(np.radians(df1.values))*df2.values, axis=0)
    y = np.sum(np.sin(np.radians(df1.values))*df2.values, axis=0)
    arctan = np.degrees(np.arctan2(y, x))
    return np.where(arctan>0, arctan, arctan+360).tolist()

对100行(随机数据(的比较:

burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)
Out[546]: 
    2017-01-01 00:00:00    [107.1417250368678, 256.8946560151866, 213.146...
    2017-01-01 00:10:00    [26.33395947005812, 27.786466256197127, 94.898...
    2017-01-01 00:20:00    [212.56183600787307, 284.77924347375733, 241.7...
    2017-01-01 00:30:00    [302.1659401891579, 91.1768853178421, 194.9664...
    2017-01-01 00:40:00    [90.29680187822757, 337.4345622590224, 302.219...
    2017-01-01 00:50:00    [94.88722975883893, 319.5580499260627, 204.511...
    2017-01-01 01:00:00    [133.4980653288851, 55.16669017531442, 20.7527...
    2017-01-01 01:10:00    [356.67045637546113, 151.25258425458003, 200.1...
    2017-01-01 01:20:00    [350.2489907863962, 33.284286840600046, 145.66...
    2017-01-01 01:30:00    [135.74199444105565, 62.66259615135012, 257.80...
    Freq: 10T, dtype: object
burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2)
Out[547]: 
    2017-01-01 00:00:00    [107.1417236328125, 256.8946533203125, 213.146...
    2017-01-01 00:10:00    [26.333953857421875, 27.78646469116211, 94.898...
    2017-01-01 00:20:00    [212.5618438720703, 284.77923583984375, 241.72...
    2017-01-01 00:30:00    [302.16595458984375, 91.1768798828125, 194.966...
    2017-01-01 00:40:00    [90.29680633544922, 337.4345703125, 302.219909...
    2017-01-01 00:50:00    [94.88722229003906, 319.55804443359375, 204.51...
    2017-01-01 01:00:00    [133.498046875, 55.166690826416016, 20.7527561...
    2017-01-01 01:10:00    [356.6704406738281, 151.25257873535156, 200.13...
    2017-01-01 01:20:00    [350.2489929199219, 33.2842903137207, 145.6609...
    2017-01-01 01:30:00    [135.7419891357422, 62.66258239746094, 257.807...
    Freq: 10T, dtype: object

%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean)
10 loops, best of 3: 80.3 ms per loop
%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2)
10 loops, best of 3: 10.4 ms per loop

10,000:

%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean) 
1 loop, best of 3: 6.65 s per loop
%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2)
1 loop, best of 3: 709 ms per loop

最新更新