我试图实例化幼稚的贝叶斯分类器以对文本块进行分类(使用预定义的分类(。下面的示例只是试图与男性/女性一起做。我已经尝试从文件(CSVLOADER(和下面创建实例中加载数据。问题是trainer.train((方法引发了空指针异常。这似乎是因为目标数值为无效。数据字典被填充。我如何强制实例填充的目标命令?
我的实际目标是将我在数据库中的论文摘要分类为"科学,政治,法律,健康等。看来贝叶斯分类器是对此的正确选择。
我已经在已加载的Instancelist上进行了迭代,并且似乎正确填充了,并且填充了datadictionary,但是TargetDictionary是无效的。
在Windows上使用Mallet 2.0.8
public TestMallet() throws IOException {
ArrayList<Pipe> pipelist = new ArrayList<Pipe>();
pipelist.add (new CharSequenceLowercase() ) ;
pipelist.add (new CharSequence2TokenSequence(Pattern.compile("\p{L}[\p{L}\p{P}]+\p{L}")) ) ;
pipelist.add (new TokenSequenceRemoveStopwords (new File ("c:\test\config\stopwords_en.txt"), "UTF-8", false, false, false) ) ;
pipelist.add (new TokenSequence2FeatureSequence()) ;
pipelist.add (new FeatureSequence2FeatureVector()) ; // Added but doesnt make any difference
InstanceList instances = new InstanceList (new SerialPipes(pipelist)) ;
Instance instance0 = new Instance("Hello World I am here and i am male my name is roger", "Male", "roger", "test") ;
Instance instance1 = new Instance("Hello World I am here and i am male my name is phil", "Male", "phil", "test") ;
Instance instance2 = new Instance("Hello World I am here and i am male my name is joe", "Male", "joe", "test") ;
Instance instance3 = new Instance("Hello World I am here and i am female my name is vira", "Female", "vira", "test") ;
Instance instance4 = new Instance("Hello World I am here and i am female my name is josie", "Female", "josie", "test") ;
instances.addThruPipe (instance0) ;
instances.addThruPipe (instance1) ;
instances.addThruPipe (instance2) ;
instances.addThruPipe (instance3) ;
instances.addThruPipe (instance4) ;
// Using Instance List to train
// ----------------------------
ClassifierTrainer trainer = new NaiveBayesTrainer();
trainer.train(instances);
// Null pointer exception here ( debugging, it looks like TargetDictionary is null)
}
期望培训师正确分析。
分类器学会根据输入功能预测输出。在这两种情况下,我们通常都需要将字符串转换为数字表示。您是在告诉木匠如何为输入功能进行此转换,而不是输出标签。
添加Target2Label()
管道应该这样做,以示例参见Csv2Vectors
类。