Spark需要HDFS位置而不是本地目录



我试图运行spark流,但遇到了这个问题。请帮助

from pyspark.sql import SparkSession
if __name__ == "__main__":
print("Application started")
spark = SparkSession 
.builder 
.appName("Socker streaming demo") 
.master("local[*]")
.getOrCreate()
# Steam will return unbounded table
stream_df = spark
.readStream
.format("socket")
.option("host","localhost")
.option("port","1100")
.load()
print(stream_df.isStreaming)
stream_df.printSchema()
write_query = stream_df 
.writeStream
.format("console")
.start()
# this line of code will turn to streaming application into never ending
write_query.awaitTermination()
print("Application Completed")

错误正在获取

22/07/31 00:13:16 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:Users786000702AppDataLocalTemptemporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
Traceback (most recent call last):
File "D:PySparkProjectpySparkStreamsocker_streaming.py", line 23, in <module>
write_query = stream_df 
File "D:PySparkProjectvenvlibsite-packagespysparksqlstreaming.py", line 1202, in start
return self._sq(self._jwrite.start())
File "D:PySparkProjectvenvlibsite-packagespy4jjava_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:PySparkProjectvenvlibsite-packagespysparksqlutils.py", line 111, in deco
return f(*a, **kw)
File "D:PySparkProjectvenvlibsite-packagespy4jprotocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o36.start.

**: org.apache.hadoop.fs.InvalidPathException: Invalid path name Path part /C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 from URI hdfs://0.0.0.0:19000/C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 is not a valid filename.
at org.apache.hadoop.fs.AbstractFileSystem.getUriPath(AbstractFileSystem.java:427)
at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:366)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:809)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:805)
at** 


org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:812)
at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createCheckpointDirectory(CheckpointFileManager.scala:368)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.resolveCheckpointLocation(ResolveWriteToStream.scala:121)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:42)
at 

您可以通过编辑Spark或Hadoop conf目录中core-site.xml文件中的fs.defaultFS来修改Spark默认的FS路径

您似乎已经根据错误将其设置在hdfs://0.0.0.0:19000/而不是某个file://URI路径

相关内容

  • 没有找到相关文章