如何加快从oraclesql到pandas-df的数据加载速度



我的代码看起来是这样的,我使用pd.DataFrame.from_records将数据填充到数据帧中,但处理请求并将22 mln行的sql表中的数据加载到df中需要墙时间:1h 40min 30s

# I skipped some of the code, since there are no problems with the extract of the query, it's fast
cur = con.cursor()
def db_select(query): # takes the request text and sends it to the data_frame
cur.execute(query)
col = [column[0].lower() for column in cur.description] # parse headers
df = pd.DataFrame.from_records(cur, columns=col) # fill the data into the dataframe
return df

然后我将sql查询传递给函数:

frame = db_select("select * from table")

如何优化代码以加快进程?

cur.arraysize设置适当的值可能有助于调整获取性能。您需要确定最适合它的值。默认值为100。可以运行具有不同数组大小的代码来确定该值,例如

arr=[100,1000,10000,100000,1000000]
for size in arr:
try:
cur.prefetchrows = 0
cur.arraysize = size
start = datetime.now()
cur.execute("SELECT * FROM mytable").fetchall()
elapsed = datetime.now() - start
print("Process duration for arraysize ", size," is ", elapsed, " seconds")
except Exception as err:
print("Memory Error ", err," for arraysize ", size) 

然后设置为cur.arraysize=10000,然后从原始代码调用db_select

最新更新