JDBC不能在PySpark中工作



我正在使用spark 1.3.0,它提供了CDH 5.4.0 vm

我试图通过JDBC运行pyspark查询的代码片段我无法使用它们中的任何一个连接:

1) pyspark --driver-class-path /usr/share/java/mysql-connector-java.jar
2) os.environ['SPARK_CLASSPATH'] = "usr/share/java/mysql-connector-java.jar"

在这两种情况下,当我运行这个语句:

dept1 = sqlContext.load(source="jdbc", url="jdbc_url", dbtable="departments")

我得到错误:

Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib/spark/python/pyspark/sql/context.py", line 482, in load
     df = self._ssql_ctx.load(source, joptions)
   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
   line 538, in __call__
   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
   line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
 : java.lang.StringIndexOutOfBoundsException: String index out of range: 10
    at java.lang.String.substring(String.java:1907)
    at org.apache.spark.sql.jdbc.DriverQuirks$.get(DriverQuirks.scala:52)
    at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:93)
    at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:125)
    at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
    at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

尝试添加——jar/usr/share/java/mysql-connector-java.jar switch。为什么你做驱动类路径,你只把jar设置为驱动而不是工人。

该错误是由于缺少依赖项。你有没有想过用红移火花来代替?

要通过redshift-spark连接,请检查在spark主目录中是否有这些jar文件:

  1. spark-redshift_2.10-3.0.0-preview1.jar
  2. RedshiftJDBC41-1.1.10.1010.jar
  3. hadoop-aws-2.7.1.jar
  4. aws-java-sdk-1.7.4.jar
  5. (laws -java-sdk-s3-1.11.60.jar)(新版本,但不是所有功能都可以使用)

将这些jar文件放到$SPARK_HOME/jars/目录下,然后启动spark

pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar

(SPARK_HOME应该= "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")

这将运行Spark与所有必要的依赖关系。注意,如果使用awsAccessKeys,还需要指定身份验证类型'forward_spark_s3_credentials'=True。

from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)
df = sql_context.read 
     .format("com.databricks.spark.redshift") 
     .option("url", "jdbc:redshift://example.coyf2i236wts.eu-central-    1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") 
     .option("dbtable", "table_name") 
     .option('forward_spark_s3_credentials',True) 
     .option("tempdir", "s3n://bucket") 
     .load()

后面常见的错误有:

  • Redshift连接错误:"SSL off"
    • 解决方案:.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
  • S3错误:当卸载数据时,例如在df.show()之后,您得到消息:"您正在尝试访问的桶必须使用指定的端点进行寻址。请将未来的所有请求发送到此端点。"
    • 解决方案:桶&群集必须在同一区域内运行

你可以使用这个命令

pyspark——driver-class-path mysql-connector-java.jar——jars "usr/share/java/mysql-connector-java.jar"

或者您可以复制jar文件并将其添加到spark/jar文件夹中。现在你可以使用这个驱动程序

相关内容

  • 没有找到相关文章

最新更新