我试图在jupyter笔记本中使用Python查询AWS Redshift集群上的表。我想从结果集中随机抽样行,并使用TABLESAMPLE
。然而,我一直运行到Key Error
import pandas as pd
import redshift_connector
conn = redshift_connector.connect(
host=os.environ['host'],
database=os.environ['database'],
port=int(os.environ['port']),
user=os.environ['user'],
password=os.environ['password']
)
print(conn)
cursor = conn.cursor()
print(cursor)
cursor.execute(
"""
SELECT *
FROM tbl1
WHERE type = 'xxx'
AND created_at >= '2022-01-01 00:00:00'
AND created_at <= '2022-12-31 00:00:00'
TABLESAMPLE BERNOULLI (10);
"""
)
df = DataFrame(cursor.fetchall())
field_names = [i[0] for i in cursor.description]
df.columns = field_names`
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/anaconda3/lib/python3.10/site-packages/redshift_connector/core.py:1631, in Connection.execute(self, cursor, operation, vals)
1630 try:
-> 1631 ps = cache["ps"][key]
1632 cursor.ps = ps
KeyError: ("n SELECT * n FROM pmf n WHERE event_type = 'load'n AND created_at >= '2022-01-01 00:00:00' n AND created_at <= '2022-12-31 00:00:00'n TABLESAMPLE BERNOULLI (10);n ", ())
File ~/anaconda3/lib/python3.10/site-packages/redshift_connector/core.py:1701, in Connection.execute(self, cursor, operation, vals)
1698 else:
1699 raise e
-> 1701 self.handle_messages(cursor)
1703 # We've got row_desc that allows us to identify what we're
1704 # going to get back from this statement.
1705 output_fc = tuple(self.redshift_types[f["type_oid"]][0] for f in ps["row_desc"])
File ~/anaconda3/lib/python3.10/site-packages/redshift_connector/core.py:1969, in Connection.handle_messages(self, cursor)
1966 self.message_types[code](self._read(data_len - 4), cursor)
1968 if self.error is not None:
-> 1969 raise self.error
ProgrammingError: {'S': 'ERROR', 'C': '42601', 'M': 'syntax error at or near "TABLESAMPLE"', 'P': '220', 'F': '/home/ec2-user/padb/src/pg/src/backend/parser/parser_scan.l', 'L': '732', 'R': 'yyerror'}
cursor.execute(
"""
SELECT *
FROM tbl1
WHERE type = 'xxx'
AND created_at >= '2022-01-01 00:00:00'
AND created_at <= '2022-12-31 00:00:00'
TABLESAMPLE BERNOULLI (10);
"""
)
问题似乎与TABLESAMPLE
的工作方式有关,当使用关键字时,WHERE条件需要在TABLESAMPLE
子句之后。
SELECT *
FROM tbl1
TABLESAMPLE BERNOULLI (10)
WHERE type = 'xxx'
AND created_at >= '2022-01-01 00:00:00'
AND created_at <= '2022-12-31 00:00:00';
奇怪的是红移文档页面上没有提到这一点。从Postgres文档中更容易理解,因为Redshift是Postgres的一种风格。
似乎红移目前不支持TABLESAMPLE
,你可以尝试下面的替代
select * from (
select *, random() as sample
from tbl1
WHERE type = 'xxx'
AND created_at >= '2022-01-01 00:00:00'
AND created_at <= '2022-12-31 00:00:00') as samp
where sample < .01; -- return 1% of rows