Python sklearn Multilabel Classification:UserWarning:Label n

我正在尝试一个多标签分类问题。我的数据看起来像这个

DocID   Content             Tags           
1       some text here...   [70]
2       some text here...   [59]
3       some text here...  [183]
4       some text here...  [173]
5       some text here...   [71]
6       some text here...   [98]
7       some text here...  [211]
8       some text here...  [188]
.       .............      .....
.       .............      .....
.       .............      .....

这是我的代码

traindf = pd.read_csv("mul.csv")
print "This is what our training data looks like:"
print traindf
t=TfidfVectorizer()
X=traindf["Content"]
y=traindf["Tags"]
print "Original Content"
print X
X=t.fit_transform(X)
print "Content After transformation"
print X
print "Original Tags"
print y
y=MultiLabelBinarizer().fit_transform(y)
print "Tags After transformation"
print y
print "Features extracted:"
print t.get_feature_names()
print "Scores of features extracted"
idf = t.idf_
print dict(zip(t.get_feature_names(), idf))
print "Splitting into training and validation sets..."
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)
print "Training Set Content and Tags"
print Xtrain
print ytrain
print "Validation Set Content and Tags"
print Xvalidate
print yvalidate
print "Creating classifier"
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01))
clf.fit(Xtrain, ytrain)
predictions=clf.predict(Xvalidate)
print "Predicted Tags are:"
print predictions
print "Correct Tags on Validation Set are :"
print yvalidate
print "Accuracy on validation set: %.3f"  % clf.score(Xvalidate,yvalidate)

代码运行良好，但我一直收到这些消息

X:Anaconda2libsite-packagessklearnmulticlass.py:70: UserWarning: Label not 288 is present in all training examples.
  str(classes[c]))
X:Anaconda2libsite-packagessklearnmulticlass.py:70: UserWarning: Label not 304 is present in all training examples.
  str(classes[c]))
X:Anaconda2libsite-packagessklearnmulticlass.py:70: UserWarning: Label not 340 is present in all training examples.

这是什么意思？这是否表明我的数据不够多样化？

当某些项存在于所有或多个记录中时，某些数据挖掘算法会出现问题。例如，当使用Apriori算法进行关联规则挖掘时，这就是一个问题。

这是否是一个问题取决于分类器。我不知道你使用的是哪个特定的分类器，但这里有一个例子，当用最大深度拟合决策树时，这可能很重要。

假设您正在使用Hunt算法和GINI指数拟合具有最大深度的决策树，以确定最佳分割（请参阅此处的解释，幻灯片35）。第一个分割可以是关于该记录是否具有标签288。如果每个记录都有这个标签，那么GINI索引将是这种拆分的最佳索引。这意味着前这么多分割将是无用的，因为你实际上并没有分割训练集（你在一个没有288的空集中分割，而集合本身有288）。所以，树的前这么多层次都是无用的。如果然后设置最大深度，这可能会导致决策树的准确性较低。

在任何情况下，您收到的警告都不是代码的问题，充其量是数据集的问题。您应该检查您使用的分类器是否对这类事情敏感–如果是这样，当您过滤掉到处出现的标签时，它可能会给出更好的结果。

相关内容

最新更新

热门标签：