我正在研究一个8类分类问题,训练集包含大约400,000个标记实体,我使用CountVectorizer.fit()对数据进行矢量化,但我得到了内存错误,我尝试使用HashingVectorizer,但无效。
path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Vectorizing the Dataset
vect = CountVectorizer()
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
您可以设置max_features
来限制词汇表的内存使用。正确的值实际上取决于任务,因此您应该将其视为超参数并尝试对其进行调整。在NLP(英语)中,人们通常使用~10,000作为词汇量。你也可以对HashVectorizer
做同样的事情,但是你冒着哈希共谋的风险,这将导致多个单词增加相同的计数器。
path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Vectorizing the Dataset
vect = CountVectorizer(max_features=10000)
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)