是否可以使用Python脚本在google BigQuery上重复运行查询?
我想使用Google BigQuery平台查询一个数据集,获得一周的数据,我想在一年内完成这项工作。查询数据集52次有点太乏味了。相反,我更喜欢写一个Python脚本(我知道Python)。
我希望有人能为我指明正确的方向。
BigQuery为多种语言提供客户端库--请参阅https://cloud.google.com/bigquery/client-libraries——尤其是Python,文档位于https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/python/latest/?_ga=1.176926572.834714677.1415848949(您需要按照超链接来理解文档)。
https://cloud.google.com/bigquery/bigquery-api-quickstart给出了一个Java或Python命令行程序的示例,该程序使用Google BigQuery API在一个可用的示例数据集上运行查询并显示结果。导入并设置一些常量后,Python脚本可以归结为
storage = Storage('bigquery_credentials.dat')
credentials = storage.get()
if credentials is None or credentials.invalid:
# Run oauth2 flow with default arguments.
credentials = tools.run_flow(FLOW, storage, tools.argparser.parse_args([]))
http = httplib2.Http()
http = credentials.authorize(http)
bigquery_service = build('bigquery', 'v2', http=http)
try:
query_request = bigquery_service.jobs()
query_data = {'query':'SELECT TOP( title, 10) as title, COUNT(*) as revision_count FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0;'}
query_response = query_request.query(projectId=PROJECT_NUMBER,
body=query_data).execute()
print 'Query Results:'
for row in query_response['rows']:
result_row = []
for field in row['f']:
result_row.append(field['v'])
print ('t').join(result_row)
except HttpError as err:
print 'Error:', pprint.pprint(err.content)
except AccessTokenRefreshError:
print ("Credentials have been revoked or expired, please re-run"
"the application to re-authorize")
正如您所看到的,只有30行,主要涉及获取和检查授权以及处理错误。除去这些考虑,"核心"部分实际上只是其中的一半:
bigquery_service = build('bigquery', 'v2', http=http)
query_request = bigquery_service.jobs()
query_data = {'query':'SELECT TOP( title, 10) as title, COUNT(*) as revision_count FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0;'}
query_response = query_request.query(projectId=PROJECT_NUMBER,
body=query_data).execute()
print 'Query Results:'
for row in query_response['rows']:
result_row = []
for field in row['f']:
result_row.append(field['v'])
print ('t').join(result_row)
您可以使用python的谷歌数据流,如果它是一次性的,则可以从您的终端或等效设备运行它。或者,您可以在appenginecron中使用一个shell脚本,该脚本在代码中循环52次以获取数据。谷歌数据流调度。