如何在pypsark数据帧中扩展和创建文本中的常见英文缩写



我有一个数据帧,里面有文本。有一些单词,比如is not、can't等。需要扩展。

例如:

I'd -> I would
I'd -> I had

下面是数据帧

数据帧

temp = spark.createDataFrame([
(0, "Julia isn't awesome"),
(1, "I wish Java-DL couldn't use case-classes"),
(2, "Data-science wasn't my subject"),
(3, "Machine")
], ["id", "words"])
+---+----------------------------------------+
|id |words                                   |
+---+----------------------------------------+
|0  |Julia isn't awesome                     |
|1  |I wish Java-DL couldn't use case-classes|
|2  |Data-science wasn't my subject          |
|3  |Machine                                 |
+---+----------------------------------------+

我正试图在pyspark中搜索一个图书馆,但没有找到。如何实现这一点?

输出:

+---+-----------------------------------------+
|id |words                                    |
+---+-----------------------------------------+
|0  |Julia is not awesome                     |
|1  |I wish Java-DL could not use case-classes|
|2  |Data-science was not my subject          |
|3  |Machine                                  |
+---+-----------------------------------------+

可能没有pyspark库可以做到这一点,但您可以使用任何python库。这里有几种解决方案。例如,如果使用pycompacts库,则可以编写一个函数并将其apply()到数据帧。

from pycontractions import Contractions
# Load your favorite word2vec model - need to download this, available at pycontractions ink
cont = Contractions('GoogleNews-vectors-negative300.bin')
# optional, prevents loading on first expand_texts call
cont.load_models()
def expand_contractions(text):
out = list(cont.expand_texts([text], precise=True))
return out[0]
temp = temp.withColumn('expanded_words', temp['words'].apply(expand_contractions))

最新更新