读取 Spark DataFrames 中 json 行的 LZO 文件

我在HDFS中有一个大的索引lzo文件，我想在Spark数据帧中读取。该文件包含 json 文档行。

posts_dir='/data/2016/01'

posts_dir具有以下功能：

/data/2016/01/posts.lzo
/data/2016/01/posts.lzo.index

以下内容有效，但不使用索引，因此需要很长时间，因为它只使用一个映射器。

posts = spark.read.json(posts_dir)

有没有办法让它利用索引？

我首先创建了一个识别索引的RDD，然后使用from_json函数将每一行转换为StructType，有效地产生了与spark.read.json(...)类似的结果

posts_rdd = sc.newAPIHadoopFile(posts_dir,
                                'com.hadoop.mapreduce.LzoTextInputFormat',
                                'org.apache.hadoop.io.LongWritable',
                                'org.apache.hadoop.io.Text')
posts_df = posts_rdd.map(lambda x:Row(x[1]))
                    .toDF(['raw'])
                    .select(F.from_json('raw', schema=posts_schema).alias('json')).select('json.*')

我不知道有更好或更直接的方法。

相关内容

最新更新

热门标签：