我正在尝试从700万Twitter数据中提取主题。我假设每条推文都是一份文件。因此,我将所有推文存储在一个文件中,其中每一行(或推文)都被视为一个文档。我使用这个文件作为 Mallet api 的输入文件。
public static void LDAModel(int numofK,int numbofIteration,int numberofThread,String outputDir,InstanceList instances) throws Exception
{
// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
// Note that the first parameter is passed as the sum over topics, while
// the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = numofK;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
// Use two parallel samplers, which each look at one half the corpus and combine
// statistics after every iteration.
model.setNumThreads(numberofThread);
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(numbofIteration);
model.estimate();
// Show the words and topics in the first instance
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
LabelSequence topics = model.getData().get(0).topicSequence;
Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
// out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
}
System.out.println(out);
// Estimate the topic distribution of the first instance,
// given the current Gibbs state.
double[] topicDistribution = model.getTopicProbabilities(0);
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
// Show top 10 words in topics with proportions for the first document
String topicsoutput="";
for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%dt%.3ft", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
//out.format("%s ", dataAlphabet.lookupObject(idCountPair.getID()));
rank++;
}
System.out.println(out);
}
// Create a new instance with high probability of topic 0
StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}
// Create a new instance named "test instance" with empty target and source fields.
InstanceList testing = new InstanceList(instances.getPipe());
testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));
TopicInferencer inferencer = model.getInferencer();
double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
System.out.println("0t" + testProbabilities[0]);
File pathDir = new File(outputDir + File.separator+ "NumofTopics"+numTopics); //FIXME replace all strings with constants
pathDir.mkdir();
String DirPath = pathDir.getPath();
String stateFile = DirPath+File.separator+"output_state.gz";
String outputDocTopicsFile = DirPath+File.separator+"output_doc_topics.txt";
String topicKeysFile = DirPath+File.separator+"output_topic_keys";
PrintWriter writer=null;
String topicKeysFile_fromProgram = DirPath+File.separator+"output_topic";
try {
writer = new PrintWriter(topicKeysFile_fromProgram, "UTF-8");
writer.print(topicsoutput);
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
model.printTopWords(new File(topicKeysFile), 11, false);
model.printDocumentTopics(new File (outputDocTopicsFile));
model.printState(new File (stateFile));
}
public static void main(String[] args) throws Exception{
// Begin by importing documents from text to feature sequences
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\p{L}[\p{L}\p{P}]+\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(new File("H:\Data\stoplists\en.txt"), "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File("E:\Thesis Data\DataForLDA\freshnewData\cleanTweets.txt")), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\S*)[\s,]*(\S*)[\s,]*(.*)$"),
3, 2, 1)); // data, label, name fields
int numberofTopic=5;
int numberofIteration=50;
int numberofThread=6;
String outputDir="J:\Topics\";
//int numberofTopic=5;
LDAModel(numberofTopic,numberofIteration,numberofThread,outputDir,instances);
TimeUnit.SECONDS.sleep(30);
numberofTopic=10; }
我从上面的程序中得到了三个文件。1. 状态文件2. 主题比例文件3.key主题列表
我想了解每个主题分配的文件数量。例如,我从关键主题列表文件中获得了以下输出
- 0.004奥巴马 (5471) 加拿大 (5283) 女人 (5152) 投票 (4879) 警察(3965)
权重,第三列表示该主题下的单词(字数)
在这里,我在这个主题下得到了字数,但我也想显示我得到这个主题的文档数量。像这样将此输出显示为单独的文件会很有帮助。例如
主题 1: 文档1(80%) 文档2(70%) .......
任何人都可以为此提供一些想法或任何源代码吗?谢谢。
您要查找的信息包含在文件"2.主题比例"你提到。请注意,每个文档都包含每个主题,并带有一些百分比(尽管对于一个主题,百分比可能很大,而对于其他主题,百分比可能非常小)。您必须决定要从文件中提取的内容:主导主题(在第 3 列);主导话题,但仅当百分比至少为 50% 时(有时,两个主题的百分比几乎相同)......