您好,我想仅根据电影的标题对电影进行聚类。我的函数对我的数据非常好,但我有一个大问题,我的样本很大 150.000 部电影,它非常慢实际上需要 3 天才能聚集所有电影
过程:
根据电影长度对电影标题进行排序
使用计数矢量器转换电影并计算每个电影的相似性(对于每个聚类电影,我每次都拟合矢量化器并转换目标电影)
def product_similarity( clustered_movie, target_movie ):
'''
Calculates the title distance of 2 movies based on title
'''
# fitted vectorizer is a dictionary with fitted movies if wee dont fit to
# vectorizer the movie it fits and save it to dictionary
if clustered_movie in fitted_vectorizer:
vectorizer = fitted_vectorizer[clustered_movie]
a = vectorizer.transform([clustered_movie]).toarray()
b = vectorizer.transform( [target_movie] ).toarray()
similarity = cosine_similarity( a, b )
else:
clustered_movie = re.sub("[0-9]|[^w']|[_]", " ",clustered_product )
vectorizer = CountVectorizer(stop_words=None)
vectorizer = vectorizer.fit([clustered_movie])
fitted_vectorizer[clustered_movie] = vectorizer
a = vectorizer.transform([clustered_movie]).toarray()
b = vectorizer.transform( [target_movie] ).toarray()
similarity = cosine_similarity( a, b )
return similarity[0][0]
在所有
标题上安装一次 CountVectorizer。保存模型。然后使用拟合模型进行变换。