如何在BIGQUERY上导出空字段



当我尝试通过BigQuery导出JSON对象时,当有一个字段带有"null"值,它将从结果的下载中消失。

一个下载查询的例子:

EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS

SELECT NULL AS field1  

实际结果为:{}

当预期结果为:{field1: null}

如何使用空值强制导出,就像我在预期结果上显示的那样?

对于这个OP,您可以使用:

Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1

在导出数据文档中,没有提到在输出中包含空值的选项,所以我认为您可以转到功能请求报告页面并为此创建一个请求。此外,在其他项目和点上也有类似的观察,目前还不支持,详见此处。

有很多方法可以解决这个问题,让我给你展示两个选项,见下面:

选项1:直接从python调用bigquery客户端库

from google.cloud import bigquery
import json
client = bigquery.Client()
query = "select null as field1, null as field2"
query_job = client.query(query)
json_list = {}
for row in query_job:
json_row = {'field1':row[0],'field2':row[1]}
json_list.update(json_row)

with open('test.json','w+') as file:
file.write(json.dumps(json_list))

选项2:使用apache beam数据流与python和BigQuery生成所需的输出

import argparse
import re
import json
import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

def add_null_field(row, field):
if field!='skip':
row.update({field: row.get(field, None)})
return row

def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
| beam.Map(add_null_field, field='skip')
| beam.Map(json.dumps)
| beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))
if __name__ == '__main__': 
run()

运行:

python -m export --output gs://my_bucket_id/output/ 
--runner DataflowRunner 
--project my_project_id 
--region my_region 
--temp_location gs://my_bucket_id/tmp/
Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.

这两个选项都会输出你想要的结果:

{"field1": null, "field2": null}

如果它对你有帮助,并给你想要达到的结果,请告诉我。

相关内容

  • 没有找到相关文章

最新更新