Giraph不能设置稍大的superstep值

当我将superstep设置为20时，效果很好。但是当我将superstep设置为200时，它不起作用。

hadoop jar Test-jar-with-dependencies.jar org.apache.giraph.GiraphRunner test.Test -mc test.TestMC -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /input/test.txt -w 1 -ca mapred.job.tracker=s1 -ca mapreduce.job.counters.limit=1000

，最后的结果是:

16/10/20 08:56:08 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers
16/10/20 08:56:38 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer s3:22181 --zkNode /_hadoopBsp/job_1476868823433_0017/_haltComputation'
16/10/20 08:56:38 INFO mapreduce.Job: Running job: job_1476868823433_0017
16/10/20 08:56:39 INFO mapreduce.Job: Job job_1476868823433_0017 running in uber mode : false
16/10/20 08:56:39 INFO mapreduce.Job:  map 50% reduce 0%
16/10/20 08:56:47 INFO mapreduce.Job:  map 100% reduce 0%
16/10/20 08:56:47 INFO mapreduce.Job: Job job_1476868823433_0017 failed with state FAILED due to: Task failed task_1476868823433_0017_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/10/20 08:56:47 INFO mapreduce.Job: Counters: 34
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=97529
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=76
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=8
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters 
        Failed map tasks=1
        Launched map tasks=2
        Other local map tasks=2
        Total time spent by all maps in occupied slots (ms)=33269
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=33269
        Total vcore-seconds taken by all map tasks=33269
        Total megabyte-seconds taken by all map tasks=34067456
    Map-Reduce Framework
        Map input records=1
        Map output records=0
        Input split bytes=44
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=130
        CPU time spent (ms)=7280
        Physical memory (bytes) snapshot=186077184
        Virtual memory (bytes) snapshot=823398400
        Total committed heap usage (bytes)=200802304
    Zookeeper base path
        /_hadoopBsp/job_1476868823433_0017=0
    Zookeeper halt node
        /_hadoopBsp/job_1476868823433_0017/_haltComputation=0
    Zookeeper server:port
        s3:22181=0
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0

我的测试代码是:顶点计算

public class Test extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable>{
    @Override
    public void compute(
            Vertex<LongWritable, DoubleWritable, FloatWritable> vertex,
            Iterable<DoubleWritable> messages) throws IOException {
        // TODO Auto-generated method stub
    }
}

掌握计算

public class TestMC extends DefaultMasterCompute {
    @Override
    public void compute() {
        // TODO Auto-generated method
        if (getSuperstep() == 200) {
            haltComputation();
        }
    }
}

似乎计数器太小(120)，但我将其设置为1000。如何解决这个问题?

错误日志为:

2016-10-20 08:56:38,569 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 199 took 0.016 seconds ended with state ALL_SUPERSTEPS_DONE and is now on superstep 200
2016-10-20 08:56:38,573 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_stateKey":"FINISHED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 200
2016-10-20 08:56:38,574 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_stateKey":"FINISHED","_applicationAttemptKey":-1,"_superstepKey":-1}
2016-10-20 08:56:38,574 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x157df96ea710000 type:create cxid:0x236f zxid:0x143b txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1476868823433_0017/_cleanedUpDir Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir
2016-10-20 08:56:38,574 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x157df96ea710001 type:create cxid:0xd8f zxid:0x143c txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1476868823433_0017/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1476868823433_0017/_masterJobState
2016-10-20 08:56:38,575 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: cleanup: Notifying master its okay to cleanup with /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir/0_master
2016-10-20 08:56:38,575 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x157df96ea710000 type:create cxid:0x2375 zxid:0x143f txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1476868823433_0017/_cleanedUpDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir
2016-10-20 08:56:38,575 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: cleanUpZooKeeper: Node /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir already exists, no need to create.
2016-10-20 08:56:38,576 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: cleanUpZooKeeper: Got 1 of 2 desired children from /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir
2016-10-20 08:56:38,576 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir to change since only got 1 nodes.
2016-10-20 08:56:40,710 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC server unable to read call parameters: Too many counters: 121 max=120
    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
    at com.sun.proxy.$Proxy7.statusUpdate(Unknown Source)
    at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:737)
    at java.lang.Thread.run(Thread.java:745)
2016-10-20 08:56:40,879 INFO [main-EventThread] org.apache.giraph.bsp.BspService: process: cleanedUpChildrenChanged signaled
2016-10-20 08:56:40,880 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: cleanUpZooKeeper: Got 2 of 2 desired children from /_hadoopBsp/job_1476868823433_0017/_cleanedUpDir
2016-10-20 08:56:40,880 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x157df96ea710001
2016-10-20 08:56:40,882 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:22181] org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /219.223.239.57:49390 which had sessionid 0x157df96ea710001
2016-10-20 08:56:40,888 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: cleanup: Removed HDFS checkpoint directory (_bsp/_checkpoints//job_1476868823433_0017) with return = false since the job Giraph: cost.Test succeeded 
2016-10-20 08:56:40,888 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyClient: stop: Halting netty client
2016-10-20 08:56:40,890 INFO [netty-client-worker-0] org.apache.giraph.comm.netty.NettyClient: stop: reached wait threshold, 1 connections closed, releasing resources now.
2016-10-20 08:56:43,095 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyClient: stop: Netty client halted
2016-10-20 08:56:43,095 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyServer: stop: Halting netty server
2016-10-20 08:56:43,106 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyServer: stop: Start releasing resources
2016-10-20 08:56:43,780 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): IPC server unable to read call parameters: Too many counters: 121 max=120
    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
    at com.sun.proxy.$Proxy7.statusUpdate(Unknown Source)
    at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:737)
    at java.lang.Thread.run(Thread.java:745)
2016-10-20 08:56:43,793 INFO [communication thread] org.apache.hadoop.mapred.Task: Process Thread Dump: Communication exception
46 active threads
Thread 56 (netty-server-worker-15):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:596)
    io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:306)
    io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
    java.lang.Thread.run(Thread.java:745)
Thread 55 (netty-server-worker-14):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 1

现在可以工作了!在命令行中设置属性mapreduce.job.counters.limit似乎没有帮助。但是，将它添加到$HADOOP_HOME/conf/mapred-site.xml后，它工作了。

<property>
    <name>mapreduce.job.counters.limit</name>
    <value>20000</value>
    <description>Limit on the number of counters allowed per job. The default value is 200.</description>
</property>

相关内容

最新更新

热门标签：