我想提高以下代码的速度。数据集是我想通过模拟各种参数进行压力测试的交易列表,并将所有结果存储在一个表中。
我执行这项工作的方式是,设计参数的范围,然后迭代它们的值,启动数据集的副本,将参数的值分配给新列,并将所有内容连接到一个巨大的数据帧中。
我想知道是否有人有一个好主意来避免三个for循环来构建数据帧?
''
# Defining the range of parameters to simulate
volchange = range(-1,2)
spreadchange = range(-10,11)
flatchange = range(-10,11)
# the df where I store all the results
final_result = pd.DataFrame()
# Iterating over the range of parameters
for vol in volchange:
for spread in spreadchange:
for flat in flatchange:
# Creating a copy of the initial dataset, assigning the simulated values to three
# new columns and concat it with the rest, resulting in a dataframe which is
# several time the initial dataset with all the possible triplet of parameters
inter_pos = pos.copy()
inter_pos['vol_change[pts]'] = vol
inter_pos['spread_change[%]'] = spread
inter_pos['spot_change[%]'] = flat
final_result = pd.concat([final_result,inter_pos], axis = 0)
# Performing computation at dataframe level
final_result['sim_vol'] = final_result['vol_change[pts]'] + final_result['ImpliedVolatility']
final_result['spread'].multiply(final_result['spread_change[%]'])/100
final_result['sim_spread'] = final_result['spread'] + final_result['spread_change']
final_result['spot_change'] = final_result['spot'] * final_result['spot_change[%]']/100
final_result['sim_spot'] = final_result['spot'] + final_result['spot_change']
final_result['sim_price'] = final_result['sim_spot'] - final_result['sim_spread']
''
非常感谢你的帮助!
祝你度过美好的一周!
将panda数据帧连接到另一个数据帧需要很长时间。最好创建一个数据帧列表,然后使用pd.concat
将它们一次连接起来
您可以这样自己测试:
import pandas as pd
import numpy as np
from time import time
dfs = []
columns = [f"{i:02d}" for i in range(100)]
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
dfs.append(df)
new_df = pd.concat(dfs)
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 1.851675271987915
new_df = pd.DataFrame(columns=columns)
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
new_df = pd.concat([new_df, df])
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 12.258363008499146
您还可以使用itertools.product
来消除嵌套的for循环。
同样由@Ahmed AEK:建议
您可以直接将
data=itertools.product(volchange, spreadchange ,flatchange )
传递给pd.DataFrame
,并避免完全创建列表,这是一种更高效、更快的方法