分类器预测是不可靠的,这是因为我的GMM分类器没有得到正确的训练



我正在用MFCC值训练两个GMM分类器,每个分类器对应一个标签。我将一个类的所有MFCC值连接起来,并拟合到一个分类器中。对于每个分类器,我对其标签的概率求和。

def createGMMClassifiers():
    label_samples = {}
    for label, sample in training.iteritems():
        labelstack = np.empty((50,13))
        for feature in sample:
            #debugger.set_trace()
            labelstack = np.concatenate((labelstack,feature))
        label_samples[label]=labelstack
    for label in label_samples:
        #debugger.set_trace()
        classifiers[label] = mixture.GMM(n_components = n_classes)
        classifiers[label].fit(label_samples[label])
    for sample in testing['happy']:
        classify(sample)
def classify(testMFCC):
    probability = {'happy':0,'sad':0}
    for name, classifier in classifiers.iteritems():
        prediction = classifier.predict_proba(testMFCC)
        for probforlabel in prediction:
            probability[name]+=probforlabel[0]
    print 'happy ',probability['happy'],'sad ',probability['sad']
    if(probability['happy']>probability['sad']):
        print 'happy'
    else:
        print 'sad'

但是我的结果似乎不一致,我发现很难相信这是因为RandomSeed=None状态,因为所有的预测通常是所有测试数据的相同标签,但每次运行它通常给出完全相反的结果(参见输出1和输出2)。

所以我的问题是,我在训练分类器时是否做了一些明显错误的事情?

输出1:

happy  123.559202732 sad  122.409167294
happy
happy  120.000879032 sad  119.883786657
happy
happy  124.000069307 sad  123.999928962
happy
happy  118.874574047 sad  118.920941127
sad
happy  117.441353421 sad  122.71924156
sad
happy  122.210579428 sad  121.997571901
happy
happy  120.981752603 sad  120.325940128
happy
happy  126.013713257 sad  125.885047394
happy
happy  122.776016525 sad  122.12320875
happy
happy  115.064172476 sad  114.999513909
happy
输出2:

happy  123.559202732 sad  122.409167294
happy
happy  120.000879032 sad  119.883786657
happy
happy  124.000069307 sad  123.999928962
happy
happy  118.874574047 sad  118.920941127
sad
happy  117.441353421 sad  122.71924156
sad
happy  122.210579428 sad  121.997571901
happy
happy  120.981752603 sad  120.325940128
happy
happy  126.013713257 sad  125.885047394
happy
happy  122.776016525 sad  122.12320875
happy
happy  115.064172476 sad  114.999513909
happy

之前我问了一个相关的问题,并得到了正确的答案。我在下面提供链接。

每次使用GMM分类器都有不同的结果

编辑:增加主功能,收集数据,分为训练和测试

def main():
    happyDir = dir+'happy/'
    sadDir = dir+'sad/'
    training["sad"]=[]
    training["happy"]=[]
    testing["happy"]=[]
    #TestSet
    for wavFile in os.listdir(happyDir)[::-1][:10]:
        #print wavFile
        fullPath = happyDir+wavFile
        testing["happy"].append(sf.getFeatures(fullPath))
    #TrainSet
    for wavFile in os.listdir(happyDir)[::-1][10:]:
        #print wavFile
        fullPath = happyDir+wavFile
        training["happy"].append(sf.getFeatures(fullPath))
    for wavFile in os.listdir(sadDir)[::-1][10:]:
        fullPath = sadDir+wavFile
        training["sad"].append(sf.getFeatures(fullPath))
    #Ensure the number of files in set
    print "Test(Happy): ", len(testing['happy'])
    print "Train(Happy): ", len(training['happy'])
    createGMMClassifiers()

编辑2:根据答案更改代码。仍然有类似的不一致的结果

对于分类任务来说,调整给定给分类器的参数是很重要的,而且大量的分类算法遵循选择理论,这意味着如果你简单地改变模型的某些参数,你可能会得到一些巨大的不同结果。此外,重要的是要使用不同的算法,而不是对所有分类任务使用一种算法,

对于这个问题,你可以尝试不同的分类算法来测试你的数据是好的,并且为每个分类器尝试不同的参数和不同的值,然后你可以确定问题在哪里。

另一种方法是使用Grid Search来探索和优化特定分类器的最佳参数,请阅读:http://scikit-learn.org/stable/modules/grid_search.html

你的代码没有多大意义,你为每个新的训练样本重新创建分类器。

正确的培训代码方案应该是这样的:

label_samples = {}
classifiers = {}
# First we collect all samples per label into array of samples
for label, sample in samples:
     label_samples[label].concatenate(sample)
# Then we train classifier on every label data
for label in label_samples:
     classifiers[label] = mixture.GMM(n_components = n_classes)
     classifiers[label].fit(label_samples[label])

最新更新