我有一个数据帧,里面有文本。有一些单词,比如is not、can't等。需要扩展。
例如:
I'd -> I would
I'd -> I had
下面是数据帧
数据帧:
temp = spark.createDataFrame([
(0, "Julia isn't awesome"),
(1, "I wish Java-DL couldn't use case-classes"),
(2, "Data-science wasn't my subject"),
(3, "Machine")
], ["id", "words"])
+---+----------------------------------------+
|id |words |
+---+----------------------------------------+
|0 |Julia isn't awesome |
|1 |I wish Java-DL couldn't use case-classes|
|2 |Data-science wasn't my subject |
|3 |Machine |
+---+----------------------------------------+
我正试图在pyspark中搜索一个图书馆,但没有找到。如何实现这一点?
输出:
+---+-----------------------------------------+
|id |words |
+---+-----------------------------------------+
|0 |Julia is not awesome |
|1 |I wish Java-DL could not use case-classes|
|2 |Data-science was not my subject |
|3 |Machine |
+---+-----------------------------------------+
可能没有pyspark库可以做到这一点,但您可以使用任何python库。这里有几种解决方案。例如,如果使用pycompacts库,则可以编写一个函数并将其apply()
到数据帧。
from pycontractions import Contractions
# Load your favorite word2vec model - need to download this, available at pycontractions ink
cont = Contractions('GoogleNews-vectors-negative300.bin')
# optional, prevents loading on first expand_texts call
cont.load_models()
def expand_contractions(text):
out = list(cont.expand_texts([text], precise=True))
return out[0]
temp = temp.withColumn('expanded_words', temp['words'].apply(expand_contractions))