如何在PySpark中将大量列从长类型转换为整数类型

我有一个数据帧，大约有50多列，全部在"；"长"；总体安排我想批量处理其中的40个；整数"；总体安排

我必须不断重复以下内容吗？


df = df 
.withColumn('colA', col('colA').cast(IntegerType())) 
.withColumn('colB', col('colB').cast(IntegerType())) 
.withColumn('colC', col('colC').cast(IntegerType())) 
....

上面的内容对我来说很手工。我是PySpark的新手，所以不确定是否可以将所有列都放在一个列表中，并且只使用cast一次(就像我在Python中所做的那样(。

非常感谢您的帮助！

使用以下内容(如果您想一次强制转换所有列(-

from pyspark.sql.functions import col
df.select(*(col(c).cast("integer").alias(c) for c in df.columns))

在这种情况下，我可能会使用reduce，因为在python 3中，它已经变成了一个c包装器，而且速度很快。警告-如果您选择性地强制转换列，则计算成本可能会更高，因为它将在强制转换指定列之前扫描整个数据帧。可以通过运行.explain()来探索这一点

from functools import reduce 
out = reduce(
lambda df, c: df.withColumn(c, df[c].cast('integer')), 
df.columns,
df
)#.explain()

from pyspark.sql import functions as F
my_cols = ['col1', 'col2']
for c in my_cols:
df = df.withColumn(c, F.col(c).cast('Integer'))

相关内容

最新更新

热门标签：