我正在尝试通过YARN在Hadoop集群上运行基于Giraph的应用程序。我使用的命令是
yarn jar solver-1.0-SNAPSHOT.jar edu.agh.iga.adi.giraph.IgaSolverTool
首先,我需要将该 JAR 复制到发出yarn classpath
时报告的目录之一。可以肯定的是,将文件权限更改为 777。
我显然需要将那个 JAR 运送给工人,所以我这样做:
conf.setYarnLibJars(currentJar());
在代码中,currentJar()
是:
private static String currentJar() {
return new File(IgaGiraphJobFactory.class.getProtectionDomain()
.getCodeSource()
.getLocation()
.getPath()).getName();
}
这用户使用JAR名称,这似乎很好,因为应用程序不再快速崩溃(如果使用其他任何东西,它会(。相反,大约需要 10 分钟才能报告故障。日志中存在错误:
LogType:gam-stderr.log
LogLastModifiedTime:Sat Sep 14 13:24:52 +0000 2019
LogLength:2122
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/kbhit/appcache/application_1568451681492_0016/filecache/11/solver-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "pool-6-thread-2" java.lang.IllegalStateException: Could not configure the containerlaunch context for GiraphYarnTasks.
at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:391)
at org.apache.giraph.yarn.GiraphApplicationMaster.access$500(GiraphApplicationMaster.java:78)
at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.buildContainerLaunchContext(GiraphApplicationMaster.java:522)
at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.run(GiraphApplicationMaster.java:479)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0016/solver-1.0-SNAPSHOT.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1533)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1526)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1541)
at org.apache.giraph.yarn.YarnUtils.addFileToResourceMap(YarnUtils.java:153)
at org.apache.giraph.yarn.YarnUtils.addFsResourcesToMap(YarnUtils.java:77)
at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:387)
... 6 more
End of LogType:gam-stderr.log.This log file belongs to a running container (container_1568451681492_0016_01_000001) and so may not be complete.
这会导致工作器容器中找不到类错误(GiraphYarnTask(。
似乎由于某种原因,JAR 没有与配置一起传输到 HDFS(确实如此(。可能是什么原因呢?
另外,JAR似乎正在发送
1492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73672704
2019-09-14 14:08:26,252 DEBUG [DFSOutputStream] - enqueue full packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, bytesCurBlock=73737216, blockSize=134217728, appendChunk=false, blk_1073741905_1081@[DatanodeInfoWithStorage[10.164.0.6:9866,DS-2d8f815f-1e64-4a7f-bbf6-0c91ebc613d7,DISK], DatanodeInfoWithStorage[10.164.0.7:9866,DS-6a606f45-ffb7-449f-ab8b-57d5950d5172,DISK]]
2019-09-14 14:08:26,252 DEBUG [DataStreamer] - Queued packet 1142
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216
2019-09-14 14:08:26,253 DEBUG [DFSClient] - computePacketChunkSize: src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, chunkSize=516, chunksPerPacket=126, packetSize=65016
2019-09-14 14:08:26,253 DEBUG [DFSClient] - DFSClient writeChunk allocating new packet seqno=1143, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73737216
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1141 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 323347 flag: 0 flag: 0
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1142 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 326916 flag: 0 flag: 0
2019-09-14 14:08:26,254 DEBUG [DataStreamer] - Queued packet 1143
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1143 offsetInBlock: 73737216 lastPacketInBlock: false lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - Queued packet 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - Waiting for ack for: 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DFSClient seqno: 1143 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 497613 flag: 0 flag: 0
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1144 offsetInBlock: 73771432 lastPacketInBlock: true lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - DFSClient seqno: 1144 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2406978 flag: 0 flag: 0
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - Closing old block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081
2019-09-14 14:08:26,264 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #12 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2019-09-14 14:08:26,266 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #12
2019-09-14 14:08:26,267 DEBUG [ProtobufRpcEngine] - Call: complete took 4ms
2019-09-14 14:08:26,267 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #13 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo
2019-09-14 14:08:26,268 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #13
2019-09-14 14:08:26,268 DEBUG [ProtobufRpcEngine] - Call: getFileInfo took 1ms
2019-09-14 14:08:26,269 INFO [YarnUtils] - Registered file in LocalResources :: hdfs://iga-adi-m/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar
但是一旦我检查了内容,它就是空的
2019-09-14 14:16:42,795 DEBUG [ProtobufRpcEngine] - Call: getListing took 6ms
Found 1 items
-rw-r--r-- 2 yarn hadoop 187800 2019-09-14 14:08 hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0021/giraph-conf.xml
同时,如果我只是手动将 jar 复制到该目录(预测它的名称(,一切都按预期工作。怎么了?
我认为它可能与这个 GIRAPH-859 有关
似乎即使 Giraph 维护者声称它可以在 YARN 模式下运行,但事实并非如此。有许多错误使它变得困难,除非您知道根本原因是什么,就像在这种情况下一样。
这里的原因是,当Giraph将jars发送到HDFS时,工人应该可以从那里访问它,它使用一个位置上传和另一个位置下载,因此工作人员找不到文件。如果我们使用与yarn不同的用户来启动应用程序,就会发生这种情况 - 这可能是一个相当普遍的假设。
有 3 种解决方法,这两种解决方法都不理想(有些可能不适用(:
- 只是为了使用 yarn 用户运行应用程序
- 在每次计算之前手动上传 JAR(请注意,您必须确保上传到新目录(只需增加作业编号( - 还要记住,您必须先创建该目录
- 应用此修补程序并针对此版本的 Giraph 进行构建
测试了所有三个,所有工作。
我遇到了类似的错误:
20/03/04 09:40:10 ERROR yarn.GiraphYarnTask: GiraphYarnTask threw a top-level exception, failing task
java.lang.RuntimeException: run() caught an unrecoverable IOException.
at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:97)
at org.apache.giraph.yarn.GiraphYarnTask.main(GiraphYarnTask.java:183)
Caused by: java.io.FileNotFoundException: File hdfs://localhost:9000/user/schramml/_bsp/_defaultZkManagerDir/giraph_yarn_application_1583310839052_0001 does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:118)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1050)
at org.apache.giraph.zk.ZooKeeperManager.getServerListFile(ZooKeeperManager.java:346)
at org.apache.giraph.zk.ZooKeeperManager.getZooKeeperServerList(ZooKeeperManager.java:376)
at org.apache.giraph.zk.ZooKeeperManager.setup(ZooKeeperManager.java:190)
at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:449)
at org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:251)
at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:91)
... 1 more
但就我而言,原因是我使用了聚合器编写器,并且不得不从上次运行中删除编写器的文件。另一个容器中也有一个file already exist error
,但起初我发现了这个问题,也许这些信息可以帮助其他人。