当我尝试通过BigQuery导出JSON对象时,当有一个字段带有"null"值,它将从结果的下载中消失。
一个下载查询的例子:
EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS
SELECT NULL AS field1
实际结果为:{}
当预期结果为:{field1: null}
如何使用空值强制导出,就像我在预期结果上显示的那样?
对于这个OP,您可以使用:
Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1
在导出数据文档中,没有提到在输出中包含空值的选项,所以我认为您可以转到功能请求报告页面并为此创建一个请求。此外,在其他项目和点上也有类似的观察,目前还不支持,详见此处。
有很多方法可以解决这个问题,让我给你展示两个选项,见下面:
选项1:直接从python调用bigquery客户端库
from google.cloud import bigquery
import json
client = bigquery.Client()
query = "select null as field1, null as field2"
query_job = client.query(query)
json_list = {}
for row in query_job:
json_row = {'field1':row[0],'field2':row[1]}
json_list.update(json_row)
with open('test.json','w+') as file:
file.write(json.dumps(json_list))
选项2:使用apache beam数据流与python和BigQuery生成所需的输出
import argparse
import re
import json
import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def add_null_field(row, field):
if field!='skip':
row.update({field: row.get(field, None)})
return row
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
| beam.Map(add_null_field, field='skip')
| beam.Map(json.dumps)
| beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))
if __name__ == '__main__':
run()
运行:
python -m export --output gs://my_bucket_id/output/
--runner DataflowRunner
--project my_project_id
--region my_region
--temp_location gs://my_bucket_id/tmp/
Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.
这两个选项都会输出你想要的结果:
{"field1": null, "field2": null}
如果它对你有帮助,并给你想要达到的结果,请告诉我。