当通过GNU并行运行时,Beeline被第二个命令卡住了



我们正在尝试使用GNU并行运行固定数量的Hive查询。即使通过-j1将并行度设置为1(即顺序执行(,第一次执行仍然有效,但第二次执行被卡住:

$ parallel -j1 --eta --verbose beeline -e '"SELECT "{}";"' ::: a b c
beeline -e "SELECT "a";"
Computers / CPU cores / Max jobs to run
1:local / 40 / 1
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 3 AVG: 0.00s  local:1/0/100%/0.0s
+------+
| _c0  |
+------+
| a    |
+------+
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. Set system property 'log4j2.debug' to show Log4j2 internal initialization logging.
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://en02.example.cloud:2181,mn01.example.cloud:2181,mn02.example.cloud:2181/default;principal=hive/_HOST@example.cloud;serviceDiscoveryMode=zooKeeper;ssl=true;zooKeeperNamespace=hiveserver2
21/11/16 10:41:39 [main]: INFO jdbc.HiveConnection: Connected to mn01.example.cloud:10000
Connected to: Apache Hive (version 3.1.3000.7.1.6.0-297)
Driver: Hive JDBC (version 3.1.3000.7.1.6.0-297)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b): SELECT "a"
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b); Time taken: 0.102 seconds
INFO  : Executing command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b): SELECT "a"
INFO  : Completed executing command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b); Time taken: 0.006 seconds
INFO  : OK
1 row selected (0.191 seconds)
Beeline version 3.1.3000.7.1.6.0-297 by Apache Hive
Closing: 0: jdbc:hive2://en02.example.cloud:2181,mn01.example.cloud:2181,mn02.example.cloud:2181/default;principal=hive/_HOST@example.cloud;serviceDiscoveryMode=zooKeeper;ssl=true;zooKeeperNamespace=hiveserver2
beeline -e "SELECT "b";"
ETA: 79s Left: 2 AVG: 42.00s  local:1/1/100%/47.0s

进一步简化这一点,即使是对beeline --help的并行调用也会以同样的方式在第二次运行时被卡住,因此它似乎与到Hive DB的连接无关。

我们最终使其工作的解决方案是

parallel -j1 --eta --verbose beeline -e '"SELECT "{}";"' < /dev/null ::: a b c

和(谢谢@OleTange!(

parallel -j1 --eta --verbose --tty beeline -e '"SELECT "{}";"' ::: a b c

我们是如何发现的:我们在直线bash脚本和它调用的一些脚本中添加了一个set-x,将结果记录到并行运行的单独文件中,并对它们进行区分。我们看到日志中有一部分是关于的

[ -p /dev/stdin ]

以及在第一次并行执行中设置的一些环境变量,但在第二次并行执行时没有设置。然后,我们考虑了各种选项,为beeline提供了一个stdin,/dev/null版本终于成功了。

相关内容

最新更新