转换熊猫数据帧中的 tf-idf 矩阵



我有以下数据集:

test_set = ("The sun in the sky", "The sun in the light", "Do not blame it on moonlight", "Do not blame it on sunshine")

现在我使用以下代码创建一个 tf-idf 矩阵

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit_transform(test_set)
smatrix = vectorizer.transform(test_set)
smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)

我现在想做的是将此矩阵"馈送"到 knn 聚类算法中。例如,像这样:

import pandas as pd
df = pd.DataFrame([[0.2, 0.3, 0.4], [0.2, 0.3, 0.41], [0.2, 0.1, 0.05], [0.1, 0.1, 0.08]], columns=('column1', 'column2', 'column3'))
k_means = cluster.KMeans(n_clusters=2) 
k_means.fit(df)
print(k_means.labels_)

但是,我似乎无法将矩阵转换为 df。如果我这样做:

df = pd.DataFrame(tf_idf_matrix)

我得到

Traceback (most recent call last):
File "/Users/marcvanderpeet/PycharmProjects/untitled/test.py", line 47, in <module>
df = pd.DataFrame(tf_idf_matrix)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 345, in __init__
raise PandasError('DataFrame constructor not properly called!')

pandas.core.common.PandasError: DataFrame 构造函数未正确调用!

关于如何转换它的任何想法?

>tf_idf_matrix有一个类型scipy.sparse.csr.csr_matrix。您可以通过键入 type(tf_idf_matrix) 来检查这一点。在pd的熊猫文档中。我们可以看到,DataFrame 类可以获取仅传递 numpy ndarray(结构化或同构)、dict 或 DataFrame 的类的实例。要将tf_idf_matrix转换为 numpy 表示形式,您可以执行以下操作:tf_idf_matrix = tf_idf_matrix.todense() .这条线可以将scipy.sparse.csr.csr_matrix转换为numpy.matrixlib.defmatrix.matrix和pd。数据帧可以处理此类型的数据。之后,您可以获取df并将其传递给k_means.fit()方法。

注意,从 0.20 版本开始,你可以直接使用 scipy 稀疏矩阵来创建熊猫 SparseDataFrame:

sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)

我们也可以使用 sklearn 管道

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer   
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans
test_set = ["The sun in the sky", "The sun in the light", "Do not blame it on 
           moonlight", "Do not blame it on sunshine"]
df = pd.DataFrame(test_set, columns =['sent'])
print(df)
                           sent
0            The sun in the sky
1          The sun in the light
2  Do not blame it on moonlight
3  Do not blame it on sunshine
model =  Pipeline([('vectorizer',CountVectorizer()), ('tf_trans',TfidfTransformer()),('k_means', KMeans(n_clusters=2))])

# and now we can just data directly pass the data to the model
model.fit(df)


# Now if we want to predict new comment we have to just pass
print(model.predict(['enjoy sunshine ']))
o/p-->array([0])

最新更新