将CountVectorizer的稀疏矩阵添加到数据框架中,并提供分类器的免费信息 - 以稀疏格式保持其



我有以下问题。现在,我正在构建一个分类器系统,该系统将使用文本和一些其他免费信息作为输入。我将免费信息存储在大熊猫数据框架中。我使用CountVectorizer转换文本并获得稀疏矩阵。现在,为了训练分类器,我需要在同一数据框架中具有两个输入。问题在于,当我将数据框架与CountVectorizer的输出合并时,我会得到一个密集的矩阵,我的意思是我的内存非常快。有什么方法可以避免它并正确合并这两个输入,而无需获得密集的矩阵?

示例代码:

import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#how many most popular words we consider
n_features = 5000
df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
                                max_features=n_features,
                                stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])
df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)
#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])

#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']
#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)

#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)

如您所见,我设置了我的countvectorizer有5000个单词。我的原始DataFrame中只有50000行,但是我已经获得了50000x5000个单元格的矩阵,即25亿个单位。它已经需要很多内存。

您不需要使用数据框。

将数字功能从数据框架转换为numpy数组:

num_feats = df[[cols]].values
from scipy import sparse
training_data = sparse.hstack((count_vectorizer_features, num_feats))

然后,您可以使用支持稀疏数据的Scikit-Learn算法。

对于GBM,您可以使用支持稀疏的xgboost

正如@abhishekthakur已经说过的,您不必将单速编码的数据放入数据框架中。

但是,如果您想这样做,可以将pandas.sparseries添加为一列:

#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
                                max_features=n_features,
                                stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))
# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
    df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)

结果:

In [107]: df.head(3)
Out[107]:
        asin  price      reviewerID  LenReview                  Summary  LenSummary  overall  helpful  reviewSentiment         0  
0  151972036   8.48  A14NU55NQZXML2        199  really a difficult read          23        3        2          -0.7203  0.002632
1  151972036   8.48  A1CSBLAPMYV8Y0         77                      wha           3        4        0          -0.1260  0.005556
2  151972036   8.48  A1DDECXCGHDYZK        114       wordy and drags on          18        1        4           0.5707  0.004545
   ...    think  thought  trailers  trying  wanted  words  worth  wouldn  writing  young
0  ...        0        0         0       0       1      0      0       0        0      0
1  ...        0        0         0       1       0      0      0       0        0      0
2  ...        0        0         0       0       1      0      1       0        0      0
[3 rows x 78 columns]

注意记忆使用情况:

In [108]: df.memory_usage()
Out[108]:
Index               80
asin               112
price              112
reviewerID         112
LenReview          112
Summary            112
LenSummary         112
overall            112
helpful            112
reviewSentiment    112
0                  112
1                  112
2                  112
3                  112
4                  112
5                  112
6                  112
7                  112
8                  112
9                  112
10                 112
11                 112
12                 112
13                 112
14                 112
                  ...
parts               16   # memory used: # of ones multiplied by 8 (np.int64)
peter               16
picked              16
point               16
quick               16
rating              16
reader              16
reading             24
really              24
reviews             16
stars               16
start               16
story               32
tedious             16
things              16
think               16
thought             16
trailers            16
trying              16
wanted              24
words               16
worth               16
wouldn              16
writing             24
young               16
dtype: int64

pandas还支持导入稀疏矩阵,并使用其sparsedtype

存储它
import scipy.sparse    
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)

您可以将其连接到其余数据框架

相关内容

最新更新