每个单元格中索引值和列名之间具有日期差的数据帧



我有一个带有一列的数据帧,称为maturity_dates。我还有一个DateTimeIndex,叫做simulationdates。

我想创建一个数据帧,其中每个单元格都是simulationdates值减去MAT_DATE。此外,我希望这个数字最小为0,并以年为单位。

下面的代码完成了这项工作,但在大型数据帧上速度非常慢。没有for循环还有更快的方法吗?

import numpy as np
import pandas as pd
import time
maturity_dates_raw = pd.DataFrame({'year': [2015, 2016, 2017, 2018, 2019, 2020, 2021]*40,
'month': [2, 3, 3, 3, 3, 3, 3]*40,
'day': [4, 5, 5, 5, 5, 5, 5]*40})
maturity_dates = pd.to_datetime(maturity_dates_raw)
date = pd.to_datetime("4th of July, 2015")
simulationdates = date + pd.to_timedelta(np.arange(5000), 'D')

t0 = time.time()
trade_m = pd.DataFrame(index=maturity_dates.index, columns=simulationdates)
mat_date = pd.to_datetime(maturity_dates)
dates = pd.DatetimeIndex.to_series(simulationdates)
for i in range(trade_m.shape[1]):
trade_m.iloc[:, i] = np.maximum(
(mat_date - dates[i]).astype('timedelta64[D]') / 365.0, 0.0)
t1 = time.time()
print('Time to maturity done in {} seconds.'.format(np.round(t1 - t0, 4)))
print(trade_m)
Time to maturity done in 0.018 seconds.
2015-07-04  2015-07-05  2015-07-06  2015-07-07  2015-07-08
0    0.000000    0.000000    0.000000    0.000000    0.000000
1    0.671233    0.668493    0.665753    0.663014    0.660274
2    1.671233    1.668493    1.665753    1.663014    1.660274
3    2.671233    2.668493    2.665753    2.663014    2.660274
4    3.671233    3.668493    3.665753    3.663014    3.660274
5    4.673973    4.671233    4.668493    4.665753    4.663014
6    5.673973    5.671233    5.668493    5.665753    5.663014
# Setup.
maturity_dates_raw = pd.DataFrame(
{'year': [2015, 2016, 2017, 2018, 2019, 2020, 2021],
'month': [2, 3, 3, 3, 3, 3, 3],
'day': [4, 5, 5, 5, 5, 5, 5]}
)
n = 40  # Adjusts size of data (i.e. rows = n * 7).
simulation_date_count = 5000  # Adjusts number of simulation dates (i.e. columns).
maturity_dates = pd.to_datetime(pd.concat([maturity_dates_raw] * n, ignore_index=True))
date = pd.to_datetime("4th of July, 2015")
simulationdates = date + pd.to_timedelta(np.arange(simulation_date_count), 'D')
# OP Result.
trade_m = pd.DataFrame(index=maturity_dates.index, columns=simulationdates)
mat_date = pd.to_datetime(maturity_dates)
dates = pd.DatetimeIndex.to_series(simulationdates)
for i in range(trade_m.shape[1]):
trade_m.iloc[:, i] = np.maximum(
(mat_date - dates[i]).astype('timedelta64[D]') / 365.0, 0.0)
result_op = trade_m

我们可以使用字典理解来计算到期日期和模拟日期之间的差异。

# Method 1.
result_1 = pd.DataFrame(
{sim_date: [(maturity - sim_date).days / 365 
for maturity in maturity_dates] 
for sim_date in simulationdates}
).clip(lower=0)
>>> result
2015-07-04  2015-07-05  2015-07-06  2015-07-07  2015-07-08
0    0.000000    0.000000    0.000000    0.000000    0.000000
1    0.671233    0.668493    0.665753    0.663014    0.660274
2    1.671233    1.668493    1.665753    1.663014    1.660274
3    2.671233    2.668493    2.665753    2.663014    2.660274
4    3.671233    3.668493    3.665753    3.663014    3.660274
5    4.673973    4.671233    4.668493    4.665753    4.663014
6    5.673973    5.671233    5.668493    5.665753    5.663014

我们也可以只计算第一列的到期年数,然后减去模拟日期/365中的天数。这意味着我们将所有其他列视为浮动列,这将带来显著的性能优势。

# Method 2.
day_deltas = np.concatenate(
([0], np.array((simulationdates[1:] - simulationdates[:-1]).days).cumsum())) / 365
years_to_maturity = (maturity_dates - simulationdates[0]).dt.days / 365
result_2 = pd.DataFrame(
{sim_date: years_to_maturity - day_delta 
for sim_date, day_delta in zip(simulationdates, day_deltas)}
).clip(lower=0)

计时和相等性检查

# OP Method: 1min 2s ± 2.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Method 1: 27.7 s ± 2.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Method 2: 852 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result_op.shape, result_1.shape, result_2.shape
((280, 5000), (280, 5000), (280, 5000))
>>> result_op.sub(result_1).abs().sum().sum()
0.0
>>> result_1.sub(result_2).abs().sum().sum()
5.7462090641280383e-11
>>> ((result_op.index == result_1.index) & (result_1.index == result_2.index)).all()
True
>>> ((result_op.columns == result_1.columns) & (result_1.columns == result_2.columns)).all()
True

使用itertools和乘积(笛卡尔(的另一种方法:

import numpy as np
import pandas as pd
import itertools
df = pd.DataFrame(itertools.product(simulationdates, maturity_dates)).
rename(columns={0:'simulationdates',1:'maturity_dates'})
df = df.assign(dif = np.maximum((df.maturity_dates-df.simulationdates).dt.days/365,0)).
pivot_table(index='maturity_dates',columns='simulationdates', values ='dif')
df = pd.merge(maturity_dates.to_frame("maturity_dates"), df, 
left_on = "maturity_dates", right_index = True).
sort_index().drop(columns="maturity_dates")

请注意最终合并是必要的,因为maturity_dates不是唯一的。

最新更新