我正在尝试使用TF-IDF和SVM将文档分类为欺骗性或真实性。我知道这以前已经完成过,但我不太确定我是否正确实施它。我有一个文本语料库,正在构建 TF-IDF,例如
vectorizer = TfidfVectorizer(min_df=1, binary=0, use_idf=1, smooth_idf=0, sublinear_tf=1)
tf_idf_model = vectorizer.fit_transform(corpus)
features = tf_idf_model.toarray()
对于分类:
seed = random.random()
random.seed(seed)
random.shuffle(features)
random.seed(seed)
random.shuffle(labels)
features_folds = np.array_split(features, folds)
labels_folds = np.array_split(labels, folds)
for C_power in C_powers:
scores = []
start_time = time.time()
svc = svm.SVC(C=2**C_power, kernel='linear')
for k in range(folds):
features_train = list(features_folds)
features_test = features_train.pop(k)
features_train = np.concatenate(features_train)
labels_train = list(labels_folds)
labels_test = labels_train.pop(k)
labels_train = np.concatenate(labels_train)
scores.append(svc.fit(features_train, labels_train).score(features_test, labels_test))
print(scores)
但我收到的准确率为~50%。我的语料库是1600个文本。
我认为您可能希望在将 TF-IDF 矩阵输入 SVM 之前对其进行缩减,因为 SVM 不太擅长处理大型稀疏矩阵。我建议使用TruncatedSVD来降低TF-IDF矩阵的维数。
vectorizer = TfidfVectorizer(min_df=1, binary=0, use_idf=1, smooth_idf=0, sublinear_tf=1)
svd = TruncatedSVD(n_components=20)
pipeline = Pipeline([
('tfidf', vectorizer),
('svd', svd)])
features = pipeline.fit_transform(corpus)
当然,您需要调整n_components
以找到要保留的最佳组件数量。