Pyspark -计算句子中的特定单词



我有一个pyspark数据框架,其中一列包含文本内容。

我正在计算包含感叹号'的句子的数量!,加上"喜欢"这个词。和"want".

例如:其中一行包含以下句子的列:

I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?

我希望实现的期望输出就像这样(只计算包含"like"的句子)或";want"one_answers"!"):

+----+-----+
|word|count|
+----+-----+
|like|   2 |
|want|   2 |
+----+-----+

有人能帮我写一个UDF,可以做到这一点吗?这是我目前为止写的东西,但我似乎不能让它工作。

nltk.tokenize import sent_tokenize
def convert_a_sentence(a_string):
sentence = lower(nltk.sent_tokenize(a_string))
return sentence
df = df.withColumn('a_sentence', convert_a_sentence(df['text']))
df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()

如果你想要的是uni-gram(即1令牌),你可以分割空间的句子,然后爆炸,分组,计数然后过滤你希望

(df
.withColumn('words', F.split('sentence', ' '))
.withColumn('word', F.explode('words'))
.groupBy('word')
.agg(
F.count('*').alias('word_cnt')
)
.where(F.col('word').isin(['like', 'want']))
.show()
)
# Output
# +----+--------+
# |word|word_cnt|
# +----+--------+
# |want|       2|
# |like|       3|
# +----+--------+

注释#1:您可以在groupBy之前应用过滤器,包含函数

注释#2:如果你想用n-gram代替"hacking"像上面一样,你可以考虑使用SparkML包和Tokenizer

from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)
# Output
# +----------------------+----------------------------+
# |sentence              |words                       |
# +----------------------+----------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home!    |[i, want, to, go, home!]    |
# |I like fast food.     |[i, like, fast, food.]      |
# |you don't want to!    |[you, don't, want, to!]     |
# |what does he want?    |[what, does, he, want?]     |
# +----------------------+----------------------------+

或NGram

from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)
# Output
# +----------------------+----------------------------+----------------------------------------+
# |col                   |words                       |ngrams                                  |
# +----------------------+----------------------------+----------------------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!]  |
# |I want to go home!    |[i, want, to, go, home!]    |[i want, want to, to go, go home!]      |
# |I like fast food.     |[i, like, fast, food.]      |[i like, like fast, fast food.]         |
# |you don't want to!    |[you, don't, want, to!]     |[you don't, don't want, want to!]       |
# |what does he want?    |[what, does, he, want?]     |[what does, does he, he want?]          |
# +----------------------+----------------------------+----------------------------------------+

我不确定是否使用pandas或pyspark方法但使用函数

可以很容易地做到这一点
from nltk.tokenize import sent_tokenize
t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food. 
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
if "!" in sentence and "like" in sentence:
print(f"found in {sentence}")

,您应该能够找出如何计数/把它放在一个表…

最新更新