如何使用 PySpark 计算数据帧组的 TF-IDF



我的问题与此类似,但我使用的是 PySpark,这个问题在那里没有解决方案。

我的数据帧df如下所示,其中id_2表示文档 ID,id_1表示它们所属的语料库:

+------+-------+--------------------+
|  id_1|   id_2|              tokens|
+------+-------+--------------------+
|122720| 139936|[front, offic, op...|
|122720| 139935|[front, offic, op...|
|122720| 126854|[great, pitch, lo...|
|122720| 139934|[front, offic, op...|
|122720| 126895|[front, offic, op...|
|122726| 139943|[challeng, custom...|
|122726| 139944|[custom, servic, ...|
|122726| 139946|[empowerment, chapt...|
|122726| 139945|[problem, solv, c...|
|122726| 761272|[deliv, excel, gu...|
|122728| 131068|[assign, mytholog...|
|122728| 982610|[trim, compar,...|
|122779| 226646|[compar, face, to...|
|122963|1019657|[rock, tekno...|
|122964| 134344|[market, chapter,...|
|122964| 134343|[market, chapter,...|
|122965|1554436|[human, resourc, ...|
|122965|1109173|[solut, hrm...|
|122965|2328172|[right, set...|
|122965|1236259|[hrm, chapter, st...|
+------+-------+--------------------+

如何计算每个语料库文档的 TF-IDF?

tf = hashingTF.transform(df)
idfModel = idf.fit(tf)
tfidf = idfModel.transform(tf)

-- 对于给定的场景,tf应该工作得很好,因为它是特定于文档的,但使用这样的idf会考虑属于单个语料库的所有文档。

我遇到了类似的问题,这是低效的工作解决方案。任何关于进一步改进这一点的想法,不胜感激。

from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml import Pipeline
preProcStages = []
hashingTF = HashingTF()
idf = IDF() 
preProcStages += [hashingTF, idf]
pipeline = Pipeline(stages=preProcStages)
def compute_idf_in_group(df):
model = pipeline.fit(df)
data = model.transform(df)
return data
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)

resolved_groups = []
grouped_ids = list(set(df.select('id_1').rdd.flatMap(lambda x: x).collect()))
for id in grouped_ids:
sub_df = df.filter(sf.col('id_1')==id)
resolved_df = compute_idf_in_group(sub_df)
resolved_groups.append(resolved_df)

final_df = unionAll(resolved_groups)

最新更新