我正在尝试使用scikit-learn学习文本的多标签分类,我正在尝试适应Scikit随附的最初示例教程之一,用于使用Wikipedia进行语言分类文章作为培训数据。我试图在下面实现它,但是代码仍然返回每个标签的一个标签
任何人都可以建议以正确的方式启用多标签分类。
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import make_multilabel_classification
from sklearn.preprocessing import LabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.multiclass import OneVsRestClassifier
#change model_selection to cross_validation
# The training data folder must be passed as first argument - This uses the example wiki language data files
languages_data_folder = sys.argv[1]
dataset = load_files(languages_data_folder)
# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.5)
#pipeline
clf = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2))),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
target_names=dataset.target_names
# TASK: Fit the pipeline on the training set
clf.fit(docs_train, y_train)
# TASK: Predict the outcome on the testing set in a variable named y_predicted
y_predicted = clf.predict(docs_test)
print target_names
# Predict the result on some short new sentences:
sentences = [
u'This is a language detection test.',
u'Ceci est un test de dxe9tection de la langue.',
u'Dies ist ein Test, um die Sprache zu erkennen.',
u'Bonjour Mon ami. This is a language detection test.',
]
predicted = clf.predict(sentences)
for s, p in zip(sentences, predicted):
print(u'The language of "%s" is "%s"' % (s, target_names[p]))
返回 -
"这是语言检测测试"的语言。是" en"
" ceci est un test dedétedede la langue"的语言。是" fr"
" DIES IST EIN测试,UM Die Sprache Zu Erkennen"的语言。是" de"
" Bonjour Mon Ami。这是一种语言检测测试"的语言。是" en"
您可以使用Scikit-MultiLearn进行多标签分类,它是在Scikit-Learn顶部构建的库。使用语言,标签之间的相关性并不重要,因此二进制分类器应非常适合。您可以找到如何在文档中进行分类的示例,但在您的情况下,您需要替换:
('clf', OneVsRestClassifier(LinearSVC())),
('clf', BinaryRelevance(LinearSVC())),
在顶部添加导入:
from skmultilearn.problem_transform import BinaryRelevance
只记得首先安装Scikit-MultiLearn!