Mallet Api-获得一致的结果

我是LDA和mallet的新手。我有以下查询

我试着用命令行运行Mallet LDA，通过将-随机种子设置为固定值，我能够在多次运行算法时获得一致的结果

然而，我确实尝试了Mallet-Java-API，每次运行程序时，我都会得到不同的输出。我搜索了一下，发现随机种子需要修复，我在java代码中已经修复了它。我仍然得到了不同的结果。

有人能告诉我，为了获得一致的结果(当多次运行时(，我还需要考虑哪些其他参数吗？

我可能想补充一点，训练主题在多次运行(命令行(时会产生相同的结果。但是，当我重新运行导入目录，然后运行train主题时，结果与前一个不匹配。(可能如预期(。我可以只运行一次导入目录，然后通过运行训练主题来尝试不同数量的主题和迭代。同样，如果我想在使用JavaApi时复制相同的内容，则需要更改/保持不变。

我能够解决这个问题
我将在此处详细回应：
有两种方式可以运行Mallet
a.命令模式
b.使用Java API

为了在不同的运行中获得一致的结果，我们需要修复"随机种子">，并且在命令行中我们可以选择设置它。

然而，在使用API时，尽管我们可以选择设置'random seed'，但我们需要知道，这需要在适当的时候完成，否则就不起作用。(参见代码(

我已经在这里粘贴了代码，它将根据数据创建一个模型(读取InstanceList(文件然后我们可以使用相同的模型文件，设置随机种子，并确保每次运行时都能得到一致的(读取相同的(结果。

创建并保存模型以备将来使用。

注意：点击此链接可以了解输入文件的格式。http://mallet.cs.umass.edu/ap.txt

public void getModelReady(String inputFile) throws IOException {
        if(inputFile != null && (! inputFile.isEmpty())) {
            List<Pipe> pipeList = new ArrayList<Pipe>();
            pipeList.add(new Target2Label());
            pipeList.add(new Input2CharSequence("UTF-8"));
            pipeList.add(new CharSequence2TokenSequence());
            pipeList.add(new TokenSequenceLowercase());
            pipeList.add(new TokenSequenceRemoveStopwords());
            pipeList.add(new TokenSequence2FeatureSequence());      
            Reader fileReader = new InputStreamReader(new FileInputStream(new File(inputFile)), "UTF-8");
            CsvIterator ci = new CsvIterator (fileReader, Pattern.compile("^(\S*)[\s,]*(\S*)[\s,]*(.*)$"),
                    3, 2, 1); // data, label, name fields
            InstanceList instances = new InstanceList(new SerialPipes(pipeList));
            instances.addThruPipe(ci);
            ObjectOutputStream oos;
            oos = new ObjectOutputStream(new FileOutputStream("Resources\Input\Model\Model.vectors"));
            oos.writeObject(instances);
            oos.close();
        }
    }

一旦保存了模型文件，就使用上面保存的文件生成主题

public void applyLDA(ParallelTopicModel model) throws IOException {     
        InstanceList training = InstanceList.load (new File("Resources\Input\Model\Model.vectors"));
        logger.debug("InstanceList Data loaded.");
        if (training.size() > 0 &&
                training.get(0) != null) {
            Object data = training.get(0).getData();
            if (! (data instanceof FeatureSequence)) {
                logger.error("Topic modeling currently only supports feature sequences.");
                System.exit(1);
            }
        }
        // IT HAS TO BE SET HERE, BEFORE CALLING ADDINSTANCE METHOD.
        model.setRandomSeed(5);
        model.addInstances(training);
        model.estimate();       
        model.printTopWords(new File("Resources\Output\OutputFile\topic_keys_java.txt"), 25,
                false);
        model.printDocumentTopics(new File ("Resources\Output\OutputFile\document_topicssplit_java.txt"));
    }

相关内容

最新更新

热门标签：