pyspark:在 Lambda 中推广"for"语句



我正在计算DF中单列("text")的多个特征,即数字字符的数量,字母数字字符的数量…

目前我有:

def query_features(df):
my_fx = sf.udf((lambda x: [sum(c.isdigit() for c in x),
sum(c.isalnum() is False and c is not " " for c in x)]
), ArrayType(IntegerType()))
df = df.withColumn("numeric", my_fx("text")[0])
.withColumn("non_numeric", my_fx("text")[1])
return df

由于我想多次迭代字符以计算不同的特征,是否有可能泛化"for"语句(for c in x)内lambda函数?或者这已经是一个理想的解决方案了?

如果您只想做一个for循环,那么在定义的函数中使用传统的for结构(不是lambda -您当前的用例不需要lambda):

from pyspark.sql import functions as F, types as T
@F.udf(T.ArrayType(T.IntegerType()))
def my_udf(input_col):
isdigit = 0
isalnum = 0
for c in input_col:
isdigit += c.isdigit()
isalnum += (c.isalnum() is False and c is not " ")
return [isdigit , isalnum]

最新更新