nltk wordnet lemmatization with POS tag on pyspark dataframe



>我正在处理pyspark数据帧中的文本数据。 到目前为止,我已经设法将数据标记为数组列并生成下表:

print(df.schema)
StructType(List(StructField(_c0,IntegerType,true),StructField(pageid,IntegerType,true),StructField(title,StringType,true),StructField(text,ArrayType(StringType,true),true)))
df.show(5)
+---+------+-------------------+--------------------+
|_c0|pageid|              title|                text|
+---+------+-------------------+--------------------+
|  0|137277|    Sutton, Vermont|[sutton, is, town...|
|  1|137278|    Walden, Vermont|[walden, is, town...|
|  2|137279| Waterford, Vermont|[waterford, is, t...|
|  3|137280|West Burke, Vermont|[west, burke, is,...|
|  4|137281|  Wheelock, Vermont|[wheelock, is, to...|
+---+------+-------------------+--------------------+
only showing top 5 rows

然后我尝试使用 udf 函数对其进行词形还原


def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
"""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# As default pos in lemmatization is Noun
return wordnet.NOUN

def postagger(p):
import nltk
x =  list(nltk.pos_tag(p))
return x
sparkPosTagger = udf(lambda z: postagger(z),ArrayType(StringType()))
def lemmer(postags):
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
x = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag)) for [word,pos_tag] in nltk.pos_tag(postags)]
return x
sparkLemmer = udf(lambda z: lemmer(z), ArrayType(StringType()))
#df = df.select('_c0','pageid','title','text', sparkPosTagger("text").alias('lemm'))
df = df.select('_c0','pageid','title','text', sparkLemmer("lemm").alias('lems'))

返回此错误:

PicklingError: args[0] from __newobj__ args has the wrong class

我相信错误主要来自与nltk.pos_tag(postags(生成的对象的不兼容。通常,当给定令牌列表时,nltk.pos_tag(( 会生成元组列表。

不过,我坚持要找出解决方法。从代码中可以看出,我试图通过单独pos_tagging事先拆分进程,只是收到相同的错误。

有没有办法做到这一点?

与我的怀疑相反,问题实际上是由于初始函数:

def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
"""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# As default pos in lemmatization is Noun
return wordnet.NOUN

在常规的Python中工作正常。然而,在pyspark中,导入nltk时存在戏剧性,因此调用wordnet是有问题的。当其他人试图导入非索引字时,也存在类似的问题:

泡菜。PicklingError: args[0] from __newobj__ args 与 hadoop python 有错误的类

虽然我还没有解决根本原因,但我重新设计了我在网上看到的代码,作为一种实用的解决方法来删除对 WordNet 的引用(无论如何都是不必要的(:

def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
"""
if treebank_tag.startswith('J'):
return 'a'
elif treebank_tag.startswith('V'):
return 'v'
elif treebank_tag.startswith('N'):
return 'n'
elif treebank_tag.startswith('R'):
return 'r'
else:
# As default pos in lemmatization is Noun
return 'n'

def lemmatize1(data_str):
# expects a string
list_pos = 0
cleaned_str = ''
lmtzr = WordNetLemmatizer()
#text = data_str.split()
tagged_words = nltk.pos_tag(data_str)
for word in tagged_words:
lemma = lmtzr.lemmatize(word[0], get_wordnet_pos(word[1]))
if list_pos == 0:
cleaned_str = lemma
else:
cleaned_str = cleaned_str + ' ' + lemma
list_pos += 1
return cleaned_str
sparkLemmer1 = udf(lambda x: lemmatize1(x), StringType())

萨利姆汗的回答很好!我只想补充一点,最好有这样的词形还原输出(数组格式(:

sparkLemmer1 = udf(lambda x: lemmatize1(x), ArrayType(StringType()))

取而代之的是:

sparkLemmer1 = udf(lambda x: lemmatize1(x), StringType())

能够创建例如 ngram 并在 pySpark 中进行进一步的预处理。

最新更新