我目前在Google BigQuery上有一些reddit数据,我想对选择的子reddit上的所有评论做一个单词计数。查询大约90gb,因此不可能直接加载到DataLab并转换为数据帧。有人建议我在DataProc中使用Hadoop或Spark作业来创建单词计数,并设置一个连接器将BigQuery数据输入DataProc,以便DataProc可以进行单词计数。我如何在DataLab运行这个?
下面是一个使用公共BigQuery莎士比亚数据集的WordCount的PySpark示例代码:
#!/usr/bin/env python
"""BigQuery I/O PySpark example."""
from pyspark.sql import SparkSession
# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket=sys.argv[1]
spark.conf.set('temporaryGcsBucket', bucket)
spark = SparkSession
.builder
.master('yarn')
.appName('spark-bigquery-demo')
.getOrCreate()
# Load data from BigQuery.
words = spark.read.format('bigquery')
.option('table', 'bigquery-public-data:samples.shakespeare')
.load()
words.createOrReplaceTempView('words')
# Perform word count.
word_count = spark.sql(
'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()
# Saving the data to BigQuery
word_count.write.format('bigquery')
.option('table', 'wordcount_dataset.wordcount_output')
.save()
您可以将脚本保存在本地或GCS桶中,然后将其提交到Dataproc集群:
gcloud dataproc jobs submit pyspark --cluster=<cluster> <pyspark-script> -- <bucket>
更多信息请查看此文档