小贝子编程

apache spark - PySpark SQLContext.createDataFrame在声明和实际字段类型不匹配时生成空值

本文关键字：类型字段不匹配空值 PySpark spark SQLContext createDataFrame 声明 apache apache-spark pyspark apache-spark-sql
更新时间 : 2023-08-21
英文 : apache spark - PySpark SQLContext.createDataFrame producing nulls when declared and actual field types don't match

在PySpark（v1.6.2）中，将RDD转换为具有指定架构的DataFrame时，值类型与架构中声明的值类型不匹配的字段将转换为null。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType, StructField, DoubleType
sc = SparkContext()
sqlContext = SQLContext(sc)
schema = StructType([
    StructField("foo", DoubleType(), nullable=False)
])
rdd = sc.parallelize([{"foo": 1}])
df = sqlContext.createDataFrame(rdd, schema=schema)
print df.show()
+----+
| foo|
+----+
|null|
+----+

这是一个PySpark错误，还是只是非常令人惊讶但有意的行为？我期望TypeError被提升或者int被转换为与DoubleType兼容的float。

这是一种预期行为。特别是请参阅对来源的相应部分的评论：

// all other unexpected type should be null, or we will have runtime exception
// TODO(davies): we could improve this by try to cast the object to expected type
case (c, _) => null

apache spark - PySpark SQLContext.createDataFrame在声明和实际字段类型不匹配时生成空值

相关内容

最新更新

热门标签：