我有以下数据集:
test_set = ("The sun in the sky", "The sun in the light", "Do not blame it on moonlight", "Do not blame it on sunshine")
现在我使用以下代码创建一个 tf-idf 矩阵
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(test_set)
smatrix = vectorizer.transform(test_set)
smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)
我现在想做的是将此矩阵"馈送"到 knn 聚类算法中。例如,像这样:
import pandas as pd
df = pd.DataFrame([[0.2, 0.3, 0.4], [0.2, 0.3, 0.41], [0.2, 0.1, 0.05], [0.1, 0.1, 0.08]], columns=('column1', 'column2', 'column3'))
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(df)
print(k_means.labels_)
但是,我似乎无法将矩阵转换为 df。如果我这样做:
df = pd.DataFrame(tf_idf_matrix)
我得到
Traceback (most recent call last):
File "/Users/marcvanderpeet/PycharmProjects/untitled/test.py", line 47, in <module>
df = pd.DataFrame(tf_idf_matrix)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 345, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame 构造函数未正确调用!
关于如何转换它的任何想法?
>tf_idf_matrix
有一个类型scipy.sparse.csr.csr_matrix
。您可以通过键入 type(tf_idf_matrix)
来检查这一点。在pd的熊猫文档中。我们可以看到,DataFrame 类可以获取仅传递 numpy ndarray(结构化或同构)、dict 或 DataFrame 的类的实例。要将tf_idf_matrix
转换为 numpy 表示形式,您可以执行以下操作:tf_idf_matrix = tf_idf_matrix.todense()
.这条线可以将scipy.sparse.csr.csr_matrix
转换为numpy.matrixlib.defmatrix.matrix
和pd。数据帧可以处理此类型的数据。之后,您可以获取df
并将其传递给k_means.fit()
方法。
注意,从 0.20 版本开始,你可以直接使用 scipy 稀疏矩阵来创建熊猫 SparseDataFrame:
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)
我们也可以使用 sklearn 管道
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans
test_set = ["The sun in the sky", "The sun in the light", "Do not blame it on
moonlight", "Do not blame it on sunshine"]
df = pd.DataFrame(test_set, columns =['sent'])
print(df)
sent
0 The sun in the sky
1 The sun in the light
2 Do not blame it on moonlight
3 Do not blame it on sunshine
model = Pipeline([('vectorizer',CountVectorizer()), ('tf_trans',TfidfTransformer()),('k_means', KMeans(n_clusters=2))])
# and now we can just data directly pass the data to the model
model.fit(df)
# Now if we want to predict new comment we have to just pass
print(model.predict(['enjoy sunshine ']))
o/p-->array([0])