无法在 Spark 中初始化会话。如何调试"User capacity has reached its maximum limit."?



我正在尝试使用Livy rest API在Apache Spark中创建会话。User capacity has reached its maximum limit..

用户正在运行另一个spark作业。我不明白哪个容量达到了最大值,以及如何调整Spark配置参数来修复它。这是我认为相关的日志信息。我重新格式化了它,使它更清晰:

22/05/30 19:18:51 INFO Client: Submitting application application_1653913029140_0247 to ResourceManager
22/05/30 19:18:51 INFO YarnClientImpl: Submitted application application_1653913029140_0247
22/05/30 19:18:51 INFO Client: Application report for application_1653913029140_0247 (state: ACCEPTED)
22/05/30 19:18:51 INFO Client: 
client token: N/A
diagnostics: [Mon May 30 19:18:51 -0300 2022] 
Application is Activated, waiting for resources to be assigned for AM. User capacity has reached its maximum limit. 
Details : AM Partition = <DEFAULT_PARTITION> ; 
Partition Resource = <memory:2662400, vCores:234> ; 
Queue's Absolute capacity = 32.0 % ; 
Queue's Absolute used capacity = 40.76923 % ; 
Queue's Absolute max capacity = 100.0 % ; 
Queue's capacity (absolute resource) = <memory:851967, vCores:74> ; 
Queue's used capacity (absolute resource) = <memory:1085440, vCores:106> ; 
Queue's max capacity (absolute resource) = <memory:2662400, vCores:234> ; "
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1653949131433
final status: UNDEFINED
tracking URL: http://vrt1557.bndes.net:8088/proxy/application_1653913029140_0247/
user: s-dtl-p01
22/05/30 19:18:51 INFO ShutdownHookManager: Shutdown hook called

另一个正在运行的作业已经为高性能配置了一些spark参数:

conf = {'spark.yarn.appMasterEnv.PYSPARK_PYTHON': 'python3',
'spark.cores.max': 50,
'spark.executor.memory': '10g',
'spark.executor.instances': 100,
'spark.driver.memory' : '10g'
}

启动失败的作业没有配置任何spark参数,使用集群默认值。

当然,我可以调整正在运行的作业的spark参数,因此它不会阻止为新作业分配资源,但我想了解它。队列配置还有许多参数,这些参数应该与应用程序进行交互。

哪个资源已耗尽?我如何根据下面的日志发现它?

当YARN容量调度器确定应用程序请求的资源分配将违反预设的每个用户限制时,会产生此诊断。以下是LeafQueue.java的相关内容:

:
if (!userAssignable) {
application.updateAMContainerDiagnostics(AMState.ACTIVATED,
"User capacity has reached its maximum limit.");
ActivitiesLogger.APP.recordRejectedAppActivityFromLeafQueue(
activitiesManager, node, application, application.getPriority(),
ActivityDiagnosticConstant.QUEUE_HIT_USER_MAX_CAPACITY_LIMIT);
continue;
}
:

因此,您引用的队列级指标可能不足以确定超出了哪些容量限制。也许您可以为调度器启用DEBUG日志记录,然后查找从LeafQueue.canAssignToUser()方法生成的消息之一。