Pyspark 不选取自定义架构

我正在测试这段代码。

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

customSchema = StructType([ 
StructField("id", StringType(), True), 
StructField("date", StringType(), True), 
etc., etc., etc.
StructField("filename", StringType(), True)])

fullPath = "path_and_credentials_here"
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', schema = customSchema, delimiter='|').load(fullPath).withColumn("filename",input_file_name())
df.show()

现在，我的数据是竖线分隔的，第一行有一些元数据，也是竖线分隔的。奇怪的是，自定义架构实际上被忽略了。文件第一行中的元数据控制架构，而不是应用我的自定义架构，这是完全错误的。这是我看到的视图。

+------------------+----------+------------+---------+--------------------+
|               _c0|       _c1|         _c2|      _c3|            filename|
+------------------+----------+------------+---------+--------------------+
|                CP|  20190628|    22:41:58|   001586|   abfss://rawdat...|
|          asset_id|price_date|price_source|bid_value|   abfss://rawdat...|
|             2e58f|  20190628|         CPN|  108.375|   abfss://rawdat...|
|             2e58f|  20190628|         FNR|     null|   abfss://rawdat...|
etc., etc., etc.

如何应用自定义架构？

您遇到的问题是因为您使用的是较旧(不再维护(的 CSV 阅读器。请参阅包标题下方的免责声明。

如果您尝试新格式，它会起作用：

In [33]: !cat /tmp/data.csv
CP|12|12:13
a|b|c
10|12|13
In [34]: spark.read.csv(fullPath, header='false', schema = customSchema, sep='|').show()
+----+---+-----+
|name|foo|  bar|
+----+---+-----+
|  CP| 12|12:13|
|   a|  b|    c|
|  10| 12|   13|
+----+---+-----+

相关内容

最新更新

热门标签：