我有以下问题。现在,我正在构建一个分类器系统,该系统将使用文本和一些其他免费信息作为输入。我将免费信息存储在大熊猫数据框架中。我使用CountVectorizer转换文本并获得稀疏矩阵。现在,为了训练分类器,我需要在同一数据框架中具有两个输入。问题在于,当我将数据框架与CountVectorizer的输出合并时,我会得到一个密集的矩阵,我的意思是我的内存非常快。有什么方法可以避免它并正确合并这两个输入,而无需获得密集的矩阵?
示例代码:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#how many most popular words we consider
n_features = 5000
df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])
df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)
#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])
#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']
#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)
#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)
如您所见,我设置了我的countvectorizer有5000个单词。我的原始DataFrame中只有50000行,但是我已经获得了50000x5000个单元格的矩阵,即25亿个单位。它已经需要很多内存。
您不需要使用数据框。
将数字功能从数据框架转换为numpy
数组:
num_feats = df[[cols]].values
from scipy import sparse
training_data = sparse.hstack((count_vectorizer_features, num_feats))
然后,您可以使用支持稀疏数据的Scikit-Learn算法。
对于GBM,您可以使用支持稀疏的xgboost
。
正如@abhishekthakur已经说过的,您不必将单速编码的数据放入数据框架中。
但是,如果您想这样做,可以将pandas.sparseries添加为一列:
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))
# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)
结果:
In [107]: df.head(3)
Out[107]:
asin price reviewerID LenReview Summary LenSummary overall helpful reviewSentiment 0
0 151972036 8.48 A14NU55NQZXML2 199 really a difficult read 23 3 2 -0.7203 0.002632
1 151972036 8.48 A1CSBLAPMYV8Y0 77 wha 3 4 0 -0.1260 0.005556
2 151972036 8.48 A1DDECXCGHDYZK 114 wordy and drags on 18 1 4 0.5707 0.004545
... think thought trailers trying wanted words worth wouldn writing young
0 ... 0 0 0 0 1 0 0 0 0 0
1 ... 0 0 0 1 0 0 0 0 0 0
2 ... 0 0 0 0 1 0 1 0 0 0
[3 rows x 78 columns]
注意记忆使用情况:
In [108]: df.memory_usage()
Out[108]:
Index 80
asin 112
price 112
reviewerID 112
LenReview 112
Summary 112
LenSummary 112
overall 112
helpful 112
reviewSentiment 112
0 112
1 112
2 112
3 112
4 112
5 112
6 112
7 112
8 112
9 112
10 112
11 112
12 112
13 112
14 112
...
parts 16 # memory used: # of ones multiplied by 8 (np.int64)
peter 16
picked 16
point 16
quick 16
rating 16
reader 16
reading 24
really 24
reviews 16
stars 16
start 16
story 32
tedious 16
things 16
think 16
thought 16
trailers 16
trying 16
wanted 24
words 16
worth 16
wouldn 16
writing 24
young 16
dtype: int64
pandas还支持导入稀疏矩阵,并使用其sparsedtype
存储它import scipy.sparse
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)
您可以将其连接到其余数据框架