我想使用单词以及一些附加功能（例如，具有链接）在文本上构建分类模型

tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']

我使用 sklearn 来获取文本数据的稀疏矩阵

tfidf_vectorizer = TfidfVectorizer(max_df=0.90, max_features=200000, min_df=0.1, stop_words='english', use_idf=True, ntlk.tokenize,ngram_range=(1,2))

tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)

我想向其添加列以支持文本数据的其他功能。我试过：

import scipy as sc

all_data = sc.hstack((tfidf_matrix, [1,0,1]))

这给了我如下所示的数据：

array([ <3x8 sparse matrix of type '<type 'numpy.float64'>' with 10 stored elements in Compressed Sparse Row format>, 1, 1, 0], dtype=object)

当我将此数据框馈送到模型时：

`from sklearn.naive_bayes import MultinomialNB
 clf = MultinomialNB().fit(all_data, y)`

我收到回溯错误：

`Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:Anacondalibsite- packagesspyderlibwidgetsexternalshellsitecustomize.py", line 580, in   runfile
 execfile(filename, namespace)
 File "C:/Users/c/Desktop/features.py", line 157, in <module>
 clf = MultinomialNB().fit(all_data, y)
File "C:Anacondalibsite-packagessklearnnaive_bayes.py", line 302, in  fit
_, n_features = X.shape

值

错误：需要 1 个以上的值才能解压缩"

编辑：数据的形状

`tfidf_matrix.shape
 (100, 2)
 all_data.shape
 (100L,)`

是否可以将列直接附加到稀疏矩阵？如果没有，我应该如何将数据转换为可以支持此功能的格式？我担心稀疏矩阵以外的内容会增加内存占用。

"我可以将列直接附加到稀疏矩阵吗？" - 是的。您可能应该这样做，因为解包（使用todense或toarray）很容易导致大型语料库中的内存爆炸。

使用 scipy.sparse.hstack：

import numpy as np
import scipy as sp
from sklearn.feature_extraction.text import TfidfVectorizer
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
print tfidf_matrix.shape

（3， 10）

new_column = np.array([[1],[0],[1]])
print new_column.shape

（3， 1）

final = sp.sparse.hstack((tfidf_matrix, new_column))
print final.shape

（3， 11）

将稀疏矩阵转换为密集矩阵

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tweets = ['this tweet has a link htt://link','this one does not','this one does http://link.net']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)
dense = tfidf_matrix.todense()
print dense.shape
newCol = [[1],[0],[1]]
allData = np.append(dense, newCol, 1)
print allData.shape

（3升、10升）

（3升、11升）

这是正确的形式：

all_data = sc.hstack([tfidf_matrix, sc.csr_matrix([1,0,1]).T], 'csr')

将列添加到 Tfidf 矩阵

将稀疏矩阵转换为密集矩阵

（3升、10升）

（3升、11升）

相关内容

最新更新

热门标签：