尝试在 pyspark 中使用带有 sparseVector 的输出来查找具有最大 tf-idf 的单词或双字母



我一直在使用 Python/Pyspark 中使用 feature ml 实现这里描述的 TF-IDF 方法,我有一组 6 个文本文档,下面的代码所做的是获取每个 bigram 的 tf-idf,但是输入中有一个稀疏向量,我无法找到每本书中 tf-idf 数量最多的二进制图。换句话说,我想做的是找到最大数量的tf-idf,并用这个数字找到对应的单词,有什么有用的建议吗?

from pyspark import SparkConf,SparkContext
from operator import add
from pyspark.sql import SparkSession
import re
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, NGram
from pyspark.sql.functions import *
conf = SparkConf()
conf.setAppName("wordCount")
conf.set("spark.executor.memory","1g")
def removePunctuation(text):
    return re.sub('[^a-z| ]','',text.strip().lower())
def wholeFile(x):
    name=x[0]
    name=name.split('_')[1].split('/')[2]
    words = re.sub('[^a-z0-9]+',' ',x[1].lower()).split()
    return [(word,name) for word in list(words)]

sc=SparkContext(conf = conf)
text=sc.wholeTextFiles("/cosc6339_s17/books-shortlist/*")
text = text.map(lambda x:(x[0].split('_')[1].split('/')
[2],removePunctuation(x[1])))
spark = SparkSession(sc)
hasattr(text, "toDF")
wordDataFrame=text.toDF(["title","book"])
tokenizer = Tokenizer(inputCol="book", outputCol="words")
wordsData = tokenizer.transform(wordDataFrame)
ngram = NGram(n=2,inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordsData)
hashingTF = HashingTF(inputCol="words", outputCol="tf")
featurizedData = hashingTF.transform(ngramDataFrame)

idf = IDF(inputCol="tf", outputCol="idf")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

我的部分输出是这样的

 (u'30240', SparseVector(262144, {14: 0.3365, 509: 0.8473, 619: 0.5596, 1889: 
 0.8473, 2325: 0.1542, 2624: 0.8473, 2710: 0.5596, 2937: 1.2528, 3091: 1.2528, 
 3193: 1.2528, 3483: 1.2528, 3575: 1.2528, 3910: 1.2528, 3924: 0.6729, 4081: 
 0.6729, 4200: 0.0, 4378: 1.2528, 4774: 1.2528, 4783: 1.2528, 4868: 1.2528, 
 4869: 2.5055, 5213: 1.2528, 5232: 1.1192, 5381: 0.0, 5595: 0.8473, 5758: 
 1.2528, 5823: 1.2528, 6183: 5.5962, 6267: 1.2528, 6355: 0.8473, 6383: 1.2528, 
 6981: 0.3365, 7289: 1.2528, 8023: 1.2528, 8073: 0.8473, 8449: 0.0, 8733: 
 5.0111, 8804: 0.5596, 8854: 1.2528, 9001: 1.2528, 9129: 0.0, 9287: 1.2528, 
 9639: 0.0, 9988: 1.6946, 10409: 0.8473, 11104: 1.0094, 11501: 1.2528, 11951: 
 0.5596, 12247: 0.8473, 12312: 1.2528, 12399: 0.0, 12526: 1.2528, 12888: 
 1.2528, 12925: 0.8473, 13142: 0.6729, 

使用 HashingTF 转换器时,您的文本输入将使用哈希函数进行哈希处理。哈希的问题在于无法检索原始输入。

它遭受潜在的哈希冲突,其中不同的原始 散列后,要素可能会变为相同的术语。请参阅 Spark 文档

因此,您可以更好地使用 CountVectorizer 而不是散列 tf。计数矢量化器计算术语的外观(术语频率(,而不对术语进行哈希处理。原始词汇将被保存,并可以通过以下方式提取:

countVect = CountVectorizer(inputCol="words", outputCol="tf", minDF=2.0)
model = countVect.fit(wordsData) result = model.transform(wordsData) model.vocabulary

然后,您可以使用 CountVector 计算 idf。

idf = IDF(inputCol="tf", outputCol="idf")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)
rescaledData.select("name", "features")

我不确定这是否是最好的方法,但它:)将数据帧更改为熊猫 en 现在选择特征并将其与模型词汇表组合

rescaled_pd = rescaledData.toPandas()
rescaled_pd

现在按 tfidf 值或计数选择前 100

tf_idf_per_word = pd.DataFrame({'tf_idf': inputrow['1_features'].toArray(), 'vocabulary': model_vocabulary}).sort_values('tf_idf', ascending = False)
 tf_idf_per_word[tf_idf_per_word.tf_idf>0.1]
 tf_idf_per_word = tf_idf_per_word[0:100]

最新更新