我正在使用spark 1.3.0
,它提供了CDH 5.4.0 vm
我试图通过JDBC
运行pyspark
查询的代码片段我无法使用它们中的任何一个连接:
1) pyspark --driver-class-path /usr/share/java/mysql-connector-java.jar
2) os.environ['SPARK_CLASSPATH'] = "usr/share/java/mysql-connector-java.jar"
在这两种情况下,当我运行这个语句:
dept1 = sqlContext.load(source="jdbc", url="jdbc_url", dbtable="departments")
我得到错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 482, in load
df = self._ssql_ctx.load(source, joptions)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
: java.lang.StringIndexOutOfBoundsException: String index out of range: 10
at java.lang.String.substring(String.java:1907)
at org.apache.spark.sql.jdbc.DriverQuirks$.get(DriverQuirks.scala:52)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:93)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:125)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
尝试添加——jar/usr/share/java/mysql-connector-java.jar switch。为什么你做驱动类路径,你只把jar设置为驱动而不是工人。
该错误是由于缺少依赖项。你有没有想过用红移火花来代替?
要通过redshift-spark连接,请检查在spark主目录中是否有这些jar文件:
- spark-redshift_2.10-3.0.0-preview1.jar
- RedshiftJDBC41-1.1.10.1010.jar
- hadoop-aws-2.7.1.jar
- aws-java-sdk-1.7.4.jar
- (laws -java-sdk-s3-1.11.60.jar)(新版本,但不是所有功能都可以使用)
将这些jar文件放到$SPARK_HOME/jars/目录下,然后启动spark
pyspark --jars $SPARK_HOME/jars/spark-redshift_2.10-3.0.0-preview1.jar,$SPARK_HOME/jars/RedshiftJDBC41-1.1.10.1010.jar,$SPARK_HOME/jars/hadoop-aws-2.7.1.jar,$SPARK_HOME/jars/aws-java-sdk-s3-1.11.60.jar,$SPARK_HOME/jars/aws-java-sdk-1.7.4.jar
(SPARK_HOME应该= "/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec")
这将运行Spark与所有必要的依赖关系。注意,如果使用awsAccessKeys,还需要指定身份验证类型'forward_spark_s3_credentials'=True。
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESSID>)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <ACCESSKEY>)
df = sql_context.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd")
.option("dbtable", "table_name")
.option('forward_spark_s3_credentials',True)
.option("tempdir", "s3n://bucket")
.load()
后面常见的错误有:
- Redshift连接错误:"SSL off"
- 解决方案:
.option("url", "jdbc:redshift://example.coyf2i236wts.eu-central- 1.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory")
- 解决方案:
- S3错误:当卸载数据时,例如在df.show()之后,您得到消息:"您正在尝试访问的桶必须使用指定的端点进行寻址。请将未来的所有请求发送到此端点。"
- 解决方案:桶&群集必须在同一区域内运行
你可以使用这个命令
pyspark——driver-class-path mysql-connector-java.jar——jars "usr/share/java/mysql-connector-java.jar"
或者您可以复制jar文件并将其添加到spark/jar文件夹中。现在你可以使用这个驱动程序