我正在用MFCC值训练两个GMM分类器,每个分类器对应一个标签。我将一个类的所有MFCC值连接起来,并拟合到一个分类器中。对于每个分类器,我对其标签的概率求和。
def createGMMClassifiers():
label_samples = {}
for label, sample in training.iteritems():
labelstack = np.empty((50,13))
for feature in sample:
#debugger.set_trace()
labelstack = np.concatenate((labelstack,feature))
label_samples[label]=labelstack
for label in label_samples:
#debugger.set_trace()
classifiers[label] = mixture.GMM(n_components = n_classes)
classifiers[label].fit(label_samples[label])
for sample in testing['happy']:
classify(sample)
def classify(testMFCC):
probability = {'happy':0,'sad':0}
for name, classifier in classifiers.iteritems():
prediction = classifier.predict_proba(testMFCC)
for probforlabel in prediction:
probability[name]+=probforlabel[0]
print 'happy ',probability['happy'],'sad ',probability['sad']
if(probability['happy']>probability['sad']):
print 'happy'
else:
print 'sad'
但是我的结果似乎不一致,我发现很难相信这是因为RandomSeed=None状态,因为所有的预测通常是所有测试数据的相同标签,但每次运行它通常给出完全相反的结果(参见输出1和输出2)。
所以我的问题是,我在训练分类器时是否做了一些明显错误的事情?
输出1:happy 123.559202732 sad 122.409167294
happy
happy 120.000879032 sad 119.883786657
happy
happy 124.000069307 sad 123.999928962
happy
happy 118.874574047 sad 118.920941127
sad
happy 117.441353421 sad 122.71924156
sad
happy 122.210579428 sad 121.997571901
happy
happy 120.981752603 sad 120.325940128
happy
happy 126.013713257 sad 125.885047394
happy
happy 122.776016525 sad 122.12320875
happy
happy 115.064172476 sad 114.999513909
happy
输出2:happy 123.559202732 sad 122.409167294
happy
happy 120.000879032 sad 119.883786657
happy
happy 124.000069307 sad 123.999928962
happy
happy 118.874574047 sad 118.920941127
sad
happy 117.441353421 sad 122.71924156
sad
happy 122.210579428 sad 121.997571901
happy
happy 120.981752603 sad 120.325940128
happy
happy 126.013713257 sad 125.885047394
happy
happy 122.776016525 sad 122.12320875
happy
happy 115.064172476 sad 114.999513909
happy
之前我问了一个相关的问题,并得到了正确的答案。我在下面提供链接。
每次使用GMM分类器都有不同的结果
编辑:增加主功能,收集数据,分为训练和测试
def main():
happyDir = dir+'happy/'
sadDir = dir+'sad/'
training["sad"]=[]
training["happy"]=[]
testing["happy"]=[]
#TestSet
for wavFile in os.listdir(happyDir)[::-1][:10]:
#print wavFile
fullPath = happyDir+wavFile
testing["happy"].append(sf.getFeatures(fullPath))
#TrainSet
for wavFile in os.listdir(happyDir)[::-1][10:]:
#print wavFile
fullPath = happyDir+wavFile
training["happy"].append(sf.getFeatures(fullPath))
for wavFile in os.listdir(sadDir)[::-1][10:]:
fullPath = sadDir+wavFile
training["sad"].append(sf.getFeatures(fullPath))
#Ensure the number of files in set
print "Test(Happy): ", len(testing['happy'])
print "Train(Happy): ", len(training['happy'])
createGMMClassifiers()
编辑2:根据答案更改代码。仍然有类似的不一致的结果
对于分类任务来说,调整给定给分类器的参数是很重要的,而且大量的分类算法遵循选择理论,这意味着如果你简单地改变模型的某些参数,你可能会得到一些巨大的不同结果。此外,重要的是要使用不同的算法,而不是对所有分类任务使用一种算法,
对于这个问题,你可以尝试不同的分类算法来测试你的数据是好的,并且为每个分类器尝试不同的参数和不同的值,然后你可以确定问题在哪里。
另一种方法是使用Grid Search来探索和优化特定分类器的最佳参数,请阅读:http://scikit-learn.org/stable/modules/grid_search.html
你的代码没有多大意义,你为每个新的训练样本重新创建分类器。
正确的培训代码方案应该是这样的:
label_samples = {}
classifiers = {}
# First we collect all samples per label into array of samples
for label, sample in samples:
label_samples[label].concatenate(sample)
# Then we train classifier on every label data
for label in label_samples:
classifiers[label] = mixture.GMM(n_components = n_classes)
classifiers[label].fit(label_samples[label])