我有一个pyspark数据框架,其中一列包含文本内容。
我正在计算包含感叹号'的句子的数量!,加上"喜欢"这个词。和"want".
例如:其中一行包含以下句子的列:
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
我希望实现的期望输出就像这样(只计算包含"like"的句子)或";want"one_answers"!"):
+----+-----+
|word|count|
+----+-----+
|like| 2 |
|want| 2 |
+----+-----+
有人能帮我写一个UDF,可以做到这一点吗?这是我目前为止写的东西,但我似乎不能让它工作。
nltk.tokenize import sent_tokenize
def convert_a_sentence(a_string):
sentence = lower(nltk.sent_tokenize(a_string))
return sentence
df = df.withColumn('a_sentence', convert_a_sentence(df['text']))
df.select(explode('a_sentence').alias('found')).filter(df['a_sentence'].isin('like', 'want', '!').groupBy('found').count().collect()
如果你想要的是uni-gram(即1令牌),你可以分割空间的句子,然后爆炸,分组,计数然后过滤你希望
(df
.withColumn('words', F.split('sentence', ' '))
.withColumn('word', F.explode('words'))
.groupBy('word')
.agg(
F.count('*').alias('word_cnt')
)
.where(F.col('word').isin(['like', 'want']))
.show()
)
# Output
# +----+--------+
# |word|word_cnt|
# +----+--------+
# |want| 2|
# |like| 3|
# +----+--------+
注释#1:您可以在groupBy
之前应用过滤器,包含函数
注释#2:如果你想用n-gram代替"hacking"像上面一样,你可以考虑使用SparkML包和Tokenizer
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='sentence', outputCol="words")
tokenized = tokenizer.transform(df)
# Output
# +----------------------+----------------------------+
# |sentence |words |
# +----------------------+----------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |
# |I like to go shopping!|[i, like, to, go, shopping!]|
# |I want to go home! |[i, want, to, go, home!] |
# |I like fast food. |[i, like, fast, food.] |
# |you don't want to! |[you, don't, want, to!] |
# |what does he want? |[what, does, he, want?] |
# +----------------------+----------------------------+
或NGram
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramed = ngram.transform(tokenized)
# Output
# +----------------------+----------------------------+----------------------------------------+
# |col |words |ngrams |
# +----------------------+----------------------------+----------------------------------------+
# |I don't like to sing! |[i, don't, like, to, sing!] |[i don't, don't like, like to, to sing!]|
# |I like to go shopping!|[i, like, to, go, shopping!]|[i like, like to, to go, go shopping!] |
# |I want to go home! |[i, want, to, go, home!] |[i want, want to, to go, go home!] |
# |I like fast food. |[i, like, fast, food.] |[i like, like fast, fast food.] |
# |you don't want to! |[you, don't, want, to!] |[you don't, don't want, want to!] |
# |what does he want? |[what, does, he, want?] |[what does, does he, he want?] |
# +----------------------+----------------------------+----------------------------------------+
我不确定是否使用pandas或pyspark方法但使用函数
可以很容易地做到这一点from nltk.tokenize import sent_tokenize
t = """
I don't like to sing!
I like to go shopping!
I want to go home!
I like fast food.
you don't want to!
what does he want?
"""
sentences = lower(nltk.sent_tokenize(t))
for sentence in sentences:
if "!" in sentence and "like" in sentence:
print(f"found in {sentence}")
,您应该能够找出如何计数/把它放在一个表…