当我读取一个 excel 文件时,它有这样的列,
Col1
----
aaa
123
true
235
321
23.23
xxx
我需要确定本专栏中有多少数据类型。当数据很大时,处理时间也很大。pyspark有什么选择吗?
问候 灰
spark
没有内置的udfs来返回值数据类型,因此实现udf
返回数据类型,您可以使用regexp
也可以使用一个选项
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
def get_data_type(val):
data_type = None
try:
float(val)
data_type = "float"
except ValueError:
if (data_type != None and val.isnumeric()):
data_type = 'int'
else:
if (val.lower() in ("yes", "no", "true", "false")):
data_type = 'boolean'
else:
data_type = "string"
else:
if float(val).is_integer():
data_type = "int"
return data_type
get_data_type_udf = F.udf(get_data_type, T.StringType())
df = spark.createDataFrame(['aaa','123','true','235','321','23.23'], T.StringType()).toDF("col1")
df = df.select(get_data_type_udf(F.col("col1")).alias("data_type")).groupBy("data_type").count()
df.show()
结果
+---------+-----+
|data_type|count|
+---------+-----+
| int| 3|
| boolean| 1|
| string| 1|
| float| 1|
+---------+-----+
您必须先将数据加载到string
列,然后可以通过应用一些逻辑来标识类型来玩一些。下面是一个示例,如何从区分文本和数字列开始。我想应该很容易分辨出一个数字是浮点数还是整数等。
df = spark.createDataFrame([
(1, "a"),
(2, "123"),
(3, "22.12"),
(4, "c"),
(5, "True")
], ("ID","mixed_data"))
from pyspark.sql import functions as F
casted = df.select("ID", "mixed_data",F.when(F.col("mixed_data").cast('float').isNull(), "text").otherwise("some kind of number").alias("guessed_type"))
+---+----------+-------------------+
| ID|mixed_data| guessed_type|
+---+----------+-------------------+
| 1| a| text|
| 2| 123|some kind of number|
| 3| 22.12|some kind of number|
| 4| c| text|
| 5| True| text|
+---+----------+-------------------+