我从运行 Spark 的 AWS EMR 集群连接到另一个运行 presto 的 AWS EMR 集群时遇到问题。
代码 - 用python编写 - 是:
jdbcDF = spark.read
.format("jdbc")
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")
.option("url", "jdbc:presto://ec2-xxxxxxxxxxxx.ap-southeast-2.compute.amazonaws.com:8889/hive/data-lake")
.option("user", "hadoop")
.option("dbtable", "customer")
.load()
通过 AWSemr add-steps
部署,选项--packages,'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60',
部署时会引发以下错误
线程"main"中的异常 java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862( at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64( at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:237( at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:330( at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala( 由以下原因引起:org.apache.spark.SparkException:在等待结果中抛出异常: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226( at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75( at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101( at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:250( at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65( at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64( at java.security.AccessController.doPrivileged(Native Method( at javax.security.auth.Subject.doAs(Subject.java:422( at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844( ...4 更多 由以下原因引起:java.io.IOException:无法连接到 ip-xxxx-xxx.ap-southeast-2.compute.internal/xxx-xxxx:41885 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245( at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187( at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198( at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194( at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190( at java.util.concurrent.FutureTask.run(FutureTask.java:266( at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149( at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624( at java.lang.Thread.run(Thread.java:748( 由以下原因引起:io.netty.channel.AbstractChannel$AnnotatedConnectException: 连接被拒绝: ip-xxxxx.ap-southeast-2.compute.internal/xxxxxx:41885 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method( at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717( at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323( at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340( at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633( at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580( at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497( at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459( at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858( at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138( ...1 更多 原因:java.net.Connect异常:连接被拒绝 ...11 更多 日志结束类型:标准
虽然我已经编辑了上面的 IP 地址(安全第一(,但它与 Spark 从属实例上的内部 IP 地址相同。它似乎正在连接到自身并存在连接问题。
我已经打开了 AWS EC2 安全组中的端口,允许从 Spark 主/从对 presto 实例的访问。
如果有帮助,则为测试连接性而编写的快速节点脚本可以正常工作
var client = new presto.Client({
host: prestoEndpoint,
user: 'hadoop',
port: 8889,
});
client.execute({
query: 'select * from customer',
catalog: 'hive',
schema: 'data-lake',
source: 'nodejs-client',
state: function(error, query_id, stats) {
console.log({ message: 'status changed', id: query_id, stats: stats });
},
columns: function(error, data) {
console.log({ resultColumns: data });
},
data: function(error, data, columns, stats) {
console.log({data, columns});
},
success: function(error, stats) {
console.log(error);
console.log(JSON.stringify(stats, null,2));
},
error: function(error) {
console.log(error);
},
});
错误消息的关键部分似乎是
由以下原因引起:io.netty.channel.AbstractChannel$AnnotatedConnectException: 连接被拒绝: ip-xxxxx.ap-southeast-2.compute.internal/xxxxxx:41885
问题是 prest-jdbc 驱动程序的版本号
我将其从com.facebook.presto:presto-jdbc:0.60
更新为com.facebook.presto:presto-jdbc:0.225
所以完整的包参数是
--packages,'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.255',
感谢@Lamanus发现那个