hive脚本失败，由于堆空间问题，处理太多分区

我的脚本由于处理太多分区的堆空间问题而失败。为了避免这个问题，我试图将所有分区插入到单个分区中，但我面临以下错误

FAILED: SemanticException [Error 10044]: Line 1:23无法插入到目标表中，因为列数/类型不同" 20121-01-16 ":表insclause-0有78列，但查询有79列

set hive.exec.dynamic.partition=true;
set mapreduce.reduce.memory.mb=6144;
set mapreduce.reduce.java.opts=-Xmx5g;
set hive.exec.dynamic.partition=true;
insert overwrite table db_temp.travel_history_denorm partition (start_date='2021-01-16')
select * from db_temp.travel_history_denorm_temp_bq
distribute by start_date;```

Can someone please suggest what is the issue, I checked the schema for the tables it is the same. ?

您正在插入到静态分区(在目标表分区子句中指定的分区值)，在这种情况下，您不应该在选择分区列。select *返回分区列(最后一个)，这就是为什么查询失败，它应该是没有分区列:

静态分区插入:

insert overwrite table db_temp.travel_history_denorm partition (start_date='2021-01-16')
select col1, col2, col3 ... --All columns except start_date partition column
from ...

动态分区:

insert overwrite table db_temp.travel_history_denorm partition (start_date)
select * --All columns in the same order, including partition
from ...

添加distribute by触发额外的reduce步骤，所有记录根据distribute by分组，每个reducer接收单个分区。当您在每个reducer中加载许多动态分区时，这有助于解决OOM问题。如果没有分布，每个reducer将在每个分区中创建文件，同时保持太多的缓冲区。

除了distribute by，您还可以设置每个reducer的最大字节数。此设置将限制单个reducer处理的数据量，也可能有助于OOM:

set hive.exec.reducers.bytes.per.reducer=16777216; --adjust for optimal performance

如果这个数字太小，它将触发太多的减速机，如果太大-则每个减速机将处理太多的数据。相应调整。

也可以尝试动态分区加载的设置:

set hive.optimize.sort.dynamic.partition=true;

启用后，动态分区列将全局排序。这样，我们可以为每个分区只保留一个打开的记录写入器值，从而降低reducer的内存压力。

你可以组合所有这些方法:按分区键分发，bytes.per.reducer和sort.dynamic.partition用于动态分区加载。

异常消息也可以帮助理解OOM发生的确切位置并相应地修复。

相关内容

最新更新

热门标签：