scikit 0.14多标签度量



我刚刚安装了scikit 0.14,这样我就可以探索多标签度量的改进。我在hamming损失指标和分类报告中得到了一些积极的结果,但无法使混淆矩阵发挥作用。同样在分类报告中,我无法通过标签数组并在报告中打印标签。下面是代码。我是做错了什么,还是这还在发展中?

import numpy as np
import pandas as pd
import random
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
target_names = ['New York','London', 'DC']
X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york",
                    "DC is the nations capital",
                    "DC the home of the beltway",
                    "president obama lives in Washington",
                    "The washington monument in is Washington DC"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[1,0],[1,0],[2],[2],[2],[2]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new ybrk. enjoy it here and london too',
                   'What city does the washington redskins live in?'])
y_test = [[0],[1],[0,1],[2]]                   
classifier = Pipeline([
                       ('vectorizer', CountVectorizer(stop_words='english',
                             ngram_range=(1,3),
                             max_df = 1.0,
                             min_df = 0.1,
                             analyzer='word')),
                       ('tfidf', TfidfTransformer()),
                       ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
print predicted

for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import hamming_loss

hl = hamming_loss(y_test, predicted, target_names)
print " "
print " "
print "---------------------------------------------------------"
print "HAMMING LOSS"
print " "
print hl
print " "
print " "
print "---------------------------------------------------------"
print "CONFUSION MATRIX"
print " "
cm = confusion_matrix(y_test, predicted)   
print cm
print " "
print " "
print "---------------------------------------------------------"
print "CLASSIFICATION REPORT"
print " "
print classification_report(y_test, predicted)

在2013年8月14日发布的0.14版本中,多类和多值度量功能似乎得到了改进-scikit-learn.org/stable/whats_new.html

此外,问题558似乎也解决了其中的一些问题,可能在0.14,但我还没有证实这一点——https://github.com/scikit-learn/scikit-learn/issues/558.

最新更新