如何将以下功能添加到TFIDF矩阵中

你好，我有一个名为list_cluster的列表，看起来如下：

list_cluster=["hello,this","this is a test","the car is red",...]

我正在使用tfidfvectorizer来产生一个模型，如下所示：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
with open('vectorizerTFIDF.pickle', 'rb') as infile:
    tdf = pickle.load(infile)
tfidf2 = tdf.transform(list_cluster)

然后，我想在称为TFIDF2的矩阵中添加新功能，我有一个列表：

dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]

此列表具有list_cluster的相同lenght，并且代表日期有12个职位，而在1年的相应月份的位置，

例如'010000000000'代表2月，

为了首先将其用作功能，我尝试了：

import numpy as np
dates=np.array(listMonth)
dates=np.transpose(dates)

获取一个numpy阵列，然后将其转置以将其与第一个矩阵TFIDF2

串联

print("shape tfidf2: "+str(tfidf2.shape),"shape dates: "+str(dates.shape))

为了使我的矢量和矩阵连接：

tfidf2=np.hstack((tfidf2,dates[:,None]))

但这是输出：

shape tfidf2: (11159, 1927) shape dates: (11159,)
Traceback (most recent call last):
  File "Main.py", line 230, in <module>
    tfidf2=np.hstack((tfidf2,dates[:,None]))
  File "/usr/local/lib/python3.5/dist-packages/numpy/core/shape_base.py", line 278, in hstack
    return _nx.concatenate(arrs, 0)
ValueError: all the input arrays must have same number of dimensions

这种形状看起来不错，但是我不确定是什么失败，我想感谢支持该功能与我的TFIDF2矩阵相连的支持，因此，请提前感谢

您需要将所有字符串转换为Sklearn的数字。一种方法是在Sklearn的预处理模块中使用Labelbinarizer类。这将为您的原始列中的每个唯一值创建一个新的二进制列。

如果日期与tfidf2的行相同，那么我认为这将起作用。

# create tfidf2
tfidf2 = tdf.transform(list_cluster)
#create dates
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
# binarize dates
lb = LabelBinarizer()
b_dates = lb.fit_transform(dates)
new_tfidf = np.concatenate((tfidf2, b_dates), axis=1)

相关内容

最新更新

热门标签：