假设我有一个很大的表,我想使用chunksize=10000的pd.read_sql。
我现在的做法是:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('dialect://user:pass@host:port/schema')
with engine.connect() as conn:
for df in pd.read_sql('SELECT * FROM VERY_LARGE_TABLE', con=conn, chunksize=10000):
do stuff
我在这里的问题(以雪花作为数据源(是连接将在";做事";。
有可能做这样的事情吗:
engine = create_engine('dialect://user:pass@host:port/schema', echo=False)
# chunk 1
with engine.connect() as conn:
df = pd.read_sql('SELECT * FROM VERY_LARGE_TABLE', con=conn)
do stuff
# chunk 2
with engine.connect() as conn:
df = pd.read_sql('SELECT * FROM VERY_LARGE_TABLE', con=conn)
do stuff
我现在正在探索的替代方案是在引擎中设置connect_args={"client_session_keep_alive": True}
你能做一些类似的事情吗:
start = 0
chunk = 5000
while True:
with engine.connect() as conn:
query = f'SELECT * FROM VERY_LARGE_TABLE LIMIT {start}, {chunk}'
df = pd.read_sql(query, con=conn)
# if df is empty stop querying
if df.empty:
break
else: # increase start for next iteration
start += chunk
do stuff