从泡菜里装的HMM看起来没受过训练



我正试图将nltk.tag.hmm.HiddenMarkovModelTagger串行化到pickle中,以便在需要时使用它,而无需重新训练。然而,从.pcl加载后,我的HMM看起来未经训练。我的两个问题是:

  1. 我做错了什么
  2. 把HMM连载是个好主意吗什么时候有一个数据集

这是代码:

In [1]: import nltk
In [2]: from nltk.probability import *
In [3]: from nltk.util import unique_list
In [4]: import json
In [5]: with open('data.json') as data_file:
   ...:         corpus = json.load(data_file)
   ...:     
In [6]: corpus = [[tuple(l) for l in sentence] for sentence in corpus]
In [7]: tag_set = unique_list(tag for sent in corpus for (word,tag) in sent)
In [8]: symbols = unique_list(word for sent in corpus for (word,tag) in sent)
In [9]: trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
In [10]: train_corpus = corpus[:4]
In [11]: test_corpus = [corpus[4]]
In [12]: hmm = trainer.train_supervised(train_corpus, estimator=LaplaceProbDist)
In [13]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
100.00%

正如你所看到的,HMM是经过训练的。现在我腌制它:

In [14]: import pickle
In [16]: output = open('hmm.pkl', 'wb')
In [17]: pickle.dump(hmm, output)
In [18]: output.close()

重置并加载后,模型看起来比一盒石头还笨:

In [19]: %reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
In [20]: import pickle
In [21]: import json
In [22]: with open('data.json') as data_file:
   ....:     corpus = json.load(data_file)
   ....:     
In [23]: test_corpus = [corpus[4]]
In [24]: pkl_file = open('hmm.pkl', 'rb')
In [25]: hmm = pickle.load(pkl_file)
In [26]: pkl_file.close()
In [27]: type(hmm)
Out[27]: nltk.tag.hmm.HiddenMarkovModelTagger
In [28]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
0.00%

1)在In[22]之后,您需要添加-

corpus = [[tuple(l) for l in sentence] for sentence in corpus]

2) 每次为测试目的重新训练模型都会很耗时。因此,最好是pickle.dump您的模型并加载它。

相关内容

最新更新