我正在计算DF中单列("text")的多个特征,即数字字符的数量,字母数字字符的数量…
目前我有:
def query_features(df):
my_fx = sf.udf((lambda x: [sum(c.isdigit() for c in x),
sum(c.isalnum() is False and c is not " " for c in x)]
), ArrayType(IntegerType()))
df = df.withColumn("numeric", my_fx("text")[0])
.withColumn("non_numeric", my_fx("text")[1])
return df
由于我想多次迭代字符以计算不同的特征,是否有可能泛化"for"语句(for c in x
)内lambda函数?或者这已经是一个理想的解决方案了?
如果您只想做一个for循环,那么在定义的函数中使用传统的for
结构(不是lambda -您当前的用例不需要lambda):
from pyspark.sql import functions as F, types as T
@F.udf(T.ArrayType(T.IntegerType()))
def my_udf(input_col):
isdigit = 0
isalnum = 0
for c in input_col:
isdigit += c.isdigit()
isalnum += (c.isalnum() is False and c is not " ")
return [isdigit , isalnum]