Hive job在cassandra集群上不能正常运行,reducer卡住



我有6个节点datastax cassandra集群(3个cassandra- 3个分析)。我正在使用hive生成报告。

问题是当我运行一个带有count(*)或group by query的hive job时,映射器完成了,但是reducer被卡住了一个特定的百分比。它们永远不会改变。

(所有节点有60GB的数据。我运行hive的列族在每个节点上大约有40GB,有15列,每个列的数据大小为50,60个字符。我使用默认的hive设置)

还有它们的信息是按时间和hive作业的数据大小(正常查询)提供的。我可以在网上找到。

查询非常简单——select count(*) from table;它有300k行。
Starting Job = job_201305210728_0005, Tracking URL = http://something:50030/jobdetails.jsp?jobid=job_201305210728_0005
Kill Command = ./bin/dse hadoop job  -Dmapred.job.tracker=10.70.41.62:8012 -kill job_201305210728_0005
Hadoop job information for Stage-1: number of mappers: 7; number of reducers: 1
2013-05-22 03:52:58,811 Stage-1 map = 0%,  reduce = 0%
2013-05-22 03:53:02,841 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 1.87 sec
2013-05-22 03:53:03,863 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 1.87 sec
2013-05-22 03:53:04,874 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 2.59 sec
2013-05-22 03:53:05,882 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 2.59 sec
2013-05-22 03:53:06,894 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 3.42 sec
2013-05-22 03:53:07,901 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 3.42 sec
2013-05-22 03:53:08,908 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 5.13 sec
2013-05-22 03:53:09,919 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 5.13 sec
2013-05-22 03:53:10,929 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU 5.51 sec
2013-05-22 03:53:11,936 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 7.79 sec
2013-05-22 03:53:12,943 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 7.79 sec
2013-05-22 03:53:13,951 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 7.79 sec
2013-05-22 03:53:14,958 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 10.25 sec
2013-05-22 03:53:15,965 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 10.25 sec
2013-05-22 03:53:16,972 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 10.25 sec
2013-05-22 03:53:17,979 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:18,986 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:19,993 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:20,999 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:22,007 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:23,014 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:24,021 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:25,028 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:26,035 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:27,042 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:28,049 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:29,055 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:30,063 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:31,070 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:32,077 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:33,084 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:34,091 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:35,097 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:36,105 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:37,112 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:38,118 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:39,125 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:40,132 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:41,139 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:42,145 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:43,153 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:44,160 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:45,166 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:46,172 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:47,180 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:48,186 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:49,193 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:50,201 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:51,208 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:52,214 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:53,221 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:54,227 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:55,235 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:56,241 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:57,249 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:58,256 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
and continues....

我正在附加hive日志

 Counters plan="{"queryId":"root_20130523051919_5e968691-9119-47fa-9b45-f61a90e86452","queryType":null,"queryAttributes":{"queryString":"select count(*) from suborder_issue"},"queryCounters":"null","stageGraph":{"nodeType":"STAGE","roots":"null","adjacencyList":"]"},"stageList":[{"stageId":"Stage-1","stageType":"MAPRED","stageAttributes":"null","stageCounters":{"CNTR_NAME_Stage-1_REDUCE_PROGRESS":"14","CNTR_NAME_Stage-1_MAP_PROGRESS":"100"},"taskList":[{"taskId":"Stage-1_MAP","taskType":"MAP","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"TS_0","children":["SEL_1"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_1","children":["GBY_2"],"adjacencyType":"CONJUNCTIVE"},{"node":"GBY_2","children":["RS_3"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"TS_0","operatorType":"TABLESCAN","operatorAttributes":"null","operatorCounters":"}","done":"true","started":"true"},{"operatorId":"SEL_1","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"GBY_2","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"RS_3","operatorType":"REDUCESINK","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"}],"done":"true","started":"true"},{"taskId":"Stage-1_REDUCE","taskType":"REDUCE","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"GBY_4","children":["SEL_5"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_5","children":["FS_6"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"GBY_4","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"}","done":"false","started":"true"},{"operatorId":"SEL_5","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"},{"operatorId":"FS_6","operatorType":"FILESINK","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}" TIME="1369303703813"
TaskProgress TASK_HADOOP_PROGRESS="2013-05-23 06:08:24,818 Stage-1 map = 100%,  reduce = 14%, Cumulative CPU 23.04 sec" TASK_NUM_REDUCERS="1" TASK_NAME="org.apache.hadoop.hive.ql.exec.MapRedTask" TASK_NUM_MAPPERS="7" TASK_COUNTERS="org.apache.hadoop.hive.ql.exec.Operator$ProgressCounter.CREATED_FILES:0,Job Counters .Launched reduce tasks:1,Job Counters .SLOTS_MILLIS_MAPS:41820,Job Counters .Rack-local map tasks:8,Job Counters .Launched map tasks:8,File Input Format Counters .Bytes Read:0,FileSystemCounters.FILE_BYTES_WRITTEN:327828,FileSystemCounters.CFS_BYTES_READ:3527,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce Framework.Map output materialized bytes:203,Map-Reduce Framework.Map input records:351783,Map-Reduce Framework.Reduce shuffle bytes:78,Map-Reduce Framework.Spilled Records:6,Map-Reduce Framework.Map output bytes:79,Map-Reduce Framework.Total committed heap usage (bytes):2057568256,Map-Reduce Framework.CPU time spent (ms):23040,Map-Reduce Framework.Map input bytes:351783,Map-Reduce Framework.SPLIT_RAW_BYTES:3527,Map-Reduce Framework.Combine input records:0,Map-Reduce Framework.Reduce input records:0,Map-Reduce Framework.Reduce input groups:0,Map-Reduce Framework.Combine output records:0,Map-Reduce Framework.Physical memory (bytes) snapshot:2163638272,Map-Reduce Framework.Reduce output records:0,Map-Reduce Framework.Virtual memory (bytes) snapshot:7556251648,Map-Reduce Framework.Map output records:6" TASK_ID="Stage-1" QUERY_ID="root_20130523051919_5e968691-9119-47fa-9b45-f61a90e86452" TASK_HADOOP_ID="job_201305220529_0009" TIME="1369303704818"
Counters plan="{"queryId":"root_20130523051919_5e968691-9119-47fa-9b45-f61a90e86452","queryType":null,"queryAttributes":{"queryString":"select count(*) from suborder_issue"},"queryCounters":"null","stageGraph":{"nodeType":"STAGE","roots":"null","adjacencyList":"]"},"stageList":[{"stageId":"Stage-1","stageType":"MAPRED","stageAttributes":"null","stageCounters":{"CNTR_NAME_Stage-1_REDUCE_PROGRESS":"14","CNTR_NAME_Stage-1_MAP_PROGRESS":"100"},"taskList":[{"taskId":"Stage-1_MAP","taskType":"MAP","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"TS_0","children":["SEL_1"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_1","children":["GBY_2"],"adjacencyType":"CONJUNCTIVE"},{"node":"GBY_2","children":["RS_3"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"TS_0","operatorType":"TABLESCAN","operatorAttributes":"null","operatorCounters":"}","done":"true","started":"true"},{"operatorId":"SEL_1","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"GBY_2","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"RS_3","operatorType":"REDUCESINK","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"}],"done":"true","started":"true"},{"taskId":"Stage-1_REDUCE","taskType":"REDUCE","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"GBY_4","children":["SEL_5"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_5","children":["FS_6"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"GBY_4","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"}","done":"false","started":"true"},{"operatorId":"SEL_5","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"},{"operatorId":"FS_6","operatorType":"FILESINK","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}" TIME="1369303704818"

我可以想到两件事。要么是内存不足,需要增加堆大小,要么是数据节点上的磁盘空间不足。检查日志分区,确保它们没有被填满。

唯一100%的建议是仔细查看reducer日志(您应该在集群上有它们,可能JobTracker web UI是最好的起点)。可能的原因太多了

最新更新