以下用例:
我在数据上运行一个hive查询,这个数据大约有500GB的。gz压缩:
select count(distinct c1), c2 from t1 group by c2;
查询结果为~2800个map任务和~400个reduce任务。
设置一个有20个实例的Hadoop集群,每个实例存储160GB,任务将在97%地图和21%减少进度时停止,然后回落到94%地图和19%减少进度,然后不再有任何进展。我认为这是因为HDFS的磁盘空间达到了使用极限。也许我可以在当天晚些时候提供一个异常消息。
How ever:是否有一种方法可以根据正在处理的数据的输入大小粗略地预计算所需的HDFS磁盘空间?请记住,输入数据以.gz格式存储。
有谁知道,为什么我的MapReduce作业只利用节点的本地存储,而不是DFS?
DFS使用概述http://img27.imageshack.us/img27/5805/dfsusageoverview.png
DFS使用详情http://img542.imageshack.us/img542/5026/dfsusagedetail.png
来自映射器的异常:
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Spill failed
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:304)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:959)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.flush(GroupByOperator.java:926)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:779)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:722)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:533)
... 9 more
Caused by: java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1045)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:599)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289)
... 24 more
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/s
pill15.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:121)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1408)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:869)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1360)
以下是摘自Cloudera博客的一些注释:
每个文件的默认复制因子为3,您需要为中间shuffle文件保留大约25%的磁盘空间。因此,您需要4倍于您将存储在HDFS中的原始数据大小。然而,这些文件很少以未压缩的方式存储,根据文件内容和压缩算法的不同,我们看到存储在HDFS中的文本文件的平均压缩比高达10-20。因此,实际所需的裸磁盘空间仅为原始未压缩大小的30-50%。
如果我可以添加一些内容,如果空间确实有限,您应该考虑压缩中间输出(在mapper和reducer之间)以减少中间shuffle文件。您可以通过以下方式执行此操作,例如使用Gzip压缩:
conf.set(“mapred.compress.map.output”, “true”)
conf.set(“mapred.output.compression.type”, “BLOCK”);
conf.set(“mapred.map.output.compression.codec”, “org.apache.hadoop.io.compress.GzipCodec”);