Datalab BigQuery数据到Dataproc Hadoop字数统计



我目前在Google BigQuery上有一些reddit数据,我想对选择的子reddit上的所有评论做一个单词计数。查询大约90gb,因此不可能直接加载到DataLab并转换为数据帧。有人建议我在DataProc中使用Hadoop或Spark作业来创建单词计数,并设置一个连接器将BigQuery数据输入DataProc,以便DataProc可以进行单词计数。我如何在DataLab运行这个?

下面是一个使用公共BigQuery莎士比亚数据集的WordCount的PySpark示例代码:

#!/usr/bin/env python
"""BigQuery I/O PySpark example."""
from pyspark.sql import SparkSession
# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket=sys.argv[1]
spark.conf.set('temporaryGcsBucket', bucket)
spark = SparkSession 
.builder 
.master('yarn') 
.appName('spark-bigquery-demo') 
.getOrCreate()
# Load data from BigQuery.
words = spark.read.format('bigquery') 
.option('table', 'bigquery-public-data:samples.shakespeare') 
.load()
words.createOrReplaceTempView('words')
# Perform word count.
word_count = spark.sql(
'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()
# Saving the data to BigQuery
word_count.write.format('bigquery') 
.option('table', 'wordcount_dataset.wordcount_output') 
.save()

您可以将脚本保存在本地或GCS桶中,然后将其提交到Dataproc集群:

gcloud dataproc jobs submit pyspark --cluster=<cluster> <pyspark-script> -- <bucket>

更多信息请查看此文档