函数火花中'Column'对象的行为



我正在编写一个代码,用以下路径替换字符:[^w |]与"。关键是,当使用DataFrame 'sentenceDF'在我的函数'removePunctuation'我得到以下错误'列'对象是不可调用的'。

from pyspark.sql.functions import regexp_replace, trim, col, lower
    def removePunctuation(column):
        cleanString = column
        cleanString = cleanString.select(regexp_replace(sentenceDF['sentence'],'[^w | ]','').alias('sentence'))
        cleanString = cleanString.select(regexp_replace(cleanString['sentence'],'_','').alias('sentence'))
        cleanString = cleanString.select(lower(cleanString['sentence']))
        return cleanString

    sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                             (' No under_score!',),
                                             (' *      Remove punctuation then spaces  * ',)], ['sentence'])
    result = sentenceDF.select(removePunctuation(col('sentence')))
    result.show()

回溯:

    TypeError: 'Column' object is not callable
    --------------------------------------------------------------------------- TypeError Traceback (most recent call last) 
    <ipython-input-50-aa978fac8bae> in <module>() 
         15 (' * Remove punctuation then spaces * ',)], ['sentence']) 
         16 
    ---> 17 result = sentenceDF.select(removePunctuation(col('sentence')))  
         18 result.show() 
    <ipython-input-50-aa978fac8bae> in removePunctuation(column) 
         4 def removePunctuation(column): 
         5 cleanString = column 
   ----> 6 cleanString = cleanString.select(regexp_replace(sentenceDF['sentence'],'[^w | ]','').alias('sentence')) 
         7 cleanString = cleanString.select(regexp_replace(cleanString['sentence'],'_','').alias('sentence')) 
         8 cleanString = cleanString.select(lower(cleanString['sentence'])) TypeError: 'Column' object is not callable 
    Command took 0.09 seconds -- by andres.velez.e@gmail.com at 10/30/2016, 2:48:17 PM on My Cluster (6 GB)

只要这样做-你会得到相同的错误。

col('sentence').select()

建议:在重构成函数之前,一定要把代码写出来。

无论如何,我认为这是你想要的。
def removePunctuation(df, column):
    cleanString = df.select(trim(lower(col('sentence'))).alias('sentence'))
    cleanString = cleanString.select(regexp_replace('sentence','[^w]|s+|_','').alias('sentence'))
    return cleanString
result = removePunctuation(sentenceDF, 'sentence')
result.show()
+--------------------+
|            sentence|
+--------------------+
|               hiyou|
|        nounderscore|
|removepunctuation...|
+--------------------+

相关内容

  • 没有找到相关文章