小贝子编程

如何使用Hive / Spark-SQL生成大型数据集

本文关键字：大型数据集 Spark-SQL 何使用 Hive hadoop apache-spark hive apache-spark-sql hiveql
更新时间 : 2023-09-10
英文 : How to generate a large data set using hive / spark-sql?

例如。生成1G记录，顺序数字在1和1G之间。

创建分区的种子表

create table seed (i int)
partitioned by (p int)

用 1K 记录，填充种子表，连续数字在0到999之间。
每个记录都被插入不同的分区，因此位于不同的HDFS目录上，更重要的是 - 在另一个文件上。

P.S。

需要以下集

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.hadoop.supports.splittable.combineinputformat=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

insert into table seed partition (p)
select  i,i 
from    (select 1) x lateral view posexplode (split (space (999),' ')) e as i,x

使用 1G 记录生成表格。
种子表中的每个 1K 记录都在另一个文件上，并且由另一个容器读取。
每个容器生成 1M 记录。

create table t1g
as
select  s.i*1000000 + e.i + 1  as n
from    seed s lateral view posexplode (split (space (1000000-1),' ')) e as i,x

如何使用Hive / Spark-SQL生成大型数据集

相关内容

最新更新

热门标签：