对于以下问题,我希望用更快的方法取代apply
方法:
具有day_of_week
和closest_date
列,我需要找到指定day_of_week
的found_dates
,它们在closest_date
之前(向后(最接近,从而允许等于closest_date
的结果。
初始df
:
closest_date day_of_week
0 2009-06-01 6
1 2014-09-02 0
2 2014-10-11 4
3 2015-01-02 3
4 2015-07-11 4
我需要加快以下工作代码:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row['closest_date'] - Week(weekday=row['day_of_week'])
df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
以下只是修复found_date
应该等于closest_date
的位置,但在一周前返回。将numpy导入为np
df['closest_date_dayofweek'] = df['closest_date'].dt.dayofweek
df['found_date'] = np.where(df['closest_date_dayofweek']==df['day_of_week'],
df['closest_date'],
df['found_date'])
df = df.drop(['closest_date_dayofweek'], axis=1)
返回以下df
closest_date day_of_week found_date
0 2009-06-01 6 2009-05-31
1 2014-09-02 0 2014-09-01
2 2014-10-11 4 2014-10-10
3 2015-01-02 3 2015-01-01
4 2015-07-11 4 2015-07-10
5 2015-08-08 4 2015-08-07
上面代码的问题是apply
方法,它很慢。有什么加速的想法吗?
谢谢!
因为可能只有7
值,所以您可以使用循环,只过滤另一列匹配的行:
for i in range(7):
m = df['day_of_week'].eq(i)
df.loc[m, 'date'] = df.loc[m, 'closest_date'] - Week(weekday=i)
然后不需要新的列,使用:
df['date'] = np.where(df['closest_date'].dt.dayofweek==df['day_of_week'],
df['closest_date'], df['date'])
5000
行的性能:
from pandas.tseries.offsets import Week
def find_nearset_day_to_dayofweek(row):
return row.closest_date - Week(weekday=row['day_of_week'])
df = pd.concat([df] * 1000, ignore_index=True)
In [137]: %timeit df['date'] = df.apply(find_nearset_day_to_dayofweek, axis=1)
550 ms ± 77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [138]: %%timeit
...: for i in range(7):
...: m = df['day_of_week'].eq(i)
...: df.loc[m, 'date1'] = df.loc[m, 'closest_date'] - Week(weekday=i)
...:
38.1 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)