Vertica数据进入pySpark会抛出"Failed to find data source"



我有spark 3.2, vertica 9.2。

spark = SparkSession.builder.appName("Ukraine").master("local[*]")
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-jdbc-9.2.1-0.jar')
.config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hadoop3.2/jars/vertica-spark-3.2.1.jar')
.getOrCreate()
table = "test"
db = "myDB"
user = "myUser"
password = "myPassword"
host = "myVerticaHost"
part = "12";
opt = {"host" : host, "table" : table, "db" : db, "numPartitions" : part, "user" : user, "password" : password}
df = spark.read.format("com.vertica.spark.datasource.DefaultSource").options().load()

Py4JJavaError: An error occurred while calling o77.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: com.vertica.spark.datasource.DefaultSource. Please find packages at
http://spark.apache.org/third-party-projects.html
~/shivamenv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109     def deco(*a, **kw):
110         try:
--> 111             return f(*a, **kw)
112         except py4j.protocol.Py4JJavaError as e:
113             converted = convert_exception(e.java_exception)
~/shivamenv/venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326                 raise Py4JJavaError(
327                     "An error occurred while calling {0}{1}{2}.n".
--> 328                     format(target_id, ".", name), value)
329             else:
330                 raise Py4JError(

在此步骤之前,我已经wget 2个jar到spark jar文件夹(在sparksession配置中的那些)

我从 中得到的

https://libraries.io/maven/com.vertica.spark vertica-sparkhttps://www.vertica.com/download/vertica/client-drivers/

不确定我在这里做错了什么,是否有替代spark jars选项?

在下面的链接-

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SparkConnector/GettingTheSparkConnector.htm?tocpath=Integrating%20with%20Apache%20Spark%7C_____1

他们提到

这两个库都与Vertica服务器一起安装,并且是在以下Vertica集群的所有节点上可用位置:

Spark Connector文件位于/opt/vertica/包/SparkConnector/lib。JDBC客户端库是/opt/vertica/java/vertica-jdbc.jar

应该尝试用这些替换本地文件夹jar吗?

不需要替换本地文件夹jar。将它们复制到spark集群后,您将使用以下选项运行spark-shell命令。请在下面找到示例示例。顺便说一句,vertica官方只支持spark 2。X与vertica 9.2版本。我希望这对你有帮助。

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/SparkIntegration.htm

spark-shell——jar vertica-spark2.1 1_scala2.11.jar,vertica-jdbc-9.2.1-11.jar

date 18:26:35警告:无法为你的平台加载原生hadoop库…在适用的地方使用内置java类使用Spark的默认log4j配置文件:org/apache/spark/log4j-defaults.properties设置默认日志级别为"警告"。使用sc.setLogLevel(newLevel)来调整日志级别。对于SparkR,使用setLogLevel(newLevel)。Spark上下文Web UI可在http://dholmes14:4040Spark上下文可用'sc' (master = local[*], app id = local-1597170403068)。Spark会话可用为"Spark"。欢迎来到


//__///_/_ '/__/'///.__/_,////_版本2.4.6//

Using Scala version 2.11.12 (OpenJDK 64位Server VM, Java 1.8.0_252)键入表达式以对其求值。输入:help获取更多信息。

scala>进口org.apache.spark.sql.SparkSession

进口org.apache.spark.storage._

val df1 = spark.read.format("com.vertica.spark.datasource.DefaultSource").option("host", ").option("port", 5433).option("db", ").option("user", "dbadmin").option("dbschema", ").option("table", ").option("numPartitions", 3).option("LogLevel", "DEBUG").load()

val df2 = df1.filter("column_name between 800055 and 8000126").groupBy("column1", "column2")

spark.time (df2.show ())

最新更新