在一个人的推文中查找最受欢迎的单词



在一个项目中,我尝试查询特定用户句柄的推文,并在用户的推文中找到最常见的单词,并返回该最常见单词的频率。

下面是我的代码:

public String mostPopularWord()
{
this.removeCommonEnglishWords();
this.sortAndRemoveEmpties();
Map<String, Integer> termsCount = new HashMap<>();
for(String term : terms)
{
Integer c = termsCount.get(term);
if(c==null)
c = new Integer(0);
c++;
termsCount.put(term, c);
}
Map.Entry<String,Integer> mostRepeated = null;
for(Map.Entry<String, Integer> curr: termsCount.entrySet())
{
if(mostRepeated == null || mostRepeated.getValue()<curr.getValue())
mostRepeated = curr;
}
//frequencyMax = termsCount.get(mostRepeated.getKey());
try 
{
frequencyMax = termsCount.get(mostRepeated.getKey());
return mostRepeated.getKey();
} 
catch (NullPointerException e) 
{
System.out.println("Cannot find most popular word from the tweets.");
}
return ""; 
}

我还认为显示我在上述方法中调用的前两个方法的代码会有所帮助,如下所示。它们都在同一类中,并定义了以下内容:

private Twitter twitter;
private PrintStream consolePrint;
private List<Status> statuses;
private List<String> terms;
private String popularWord;
private int frequencyMax;
@SuppressWarnings("unchecked")
public void sortAndRemoveEmpties()
{
Collections.sort(terms);
terms.removeAll(Arrays.asList("", null));
}
private void removeCommonEnglishWords()
{          
Scanner sc = null;
try
{
sc = new Scanner(new File("commonWords.txt"));
}
catch(Exception e)
{
System.out.println("The file is not found");
}
List<String> commonWords = new ArrayList<String>(); 
int count = 0;
while(sc.hasNextLine())
{
count++;
commonWords.add(sc.nextLine()); 
}
Iterator<String> termIt = terms.iterator();
while(termIt.hasNext())
{
String term = termIt.next();
for(String word : commonWords)
if(term.equalsIgnoreCase(word))
termIt.remove();
}
}

对于相当长的代码片段,我深表歉意。但令人沮丧的是,即使我的 removeCommonEnglish(( 方法显然是正确的(在另一篇文章中讨论(,当我运行 mostPopularWord(( 时,它会返回"the",这显然是我拥有的常见英语单词列表的一部分,并且打算从列表术语中删除。我可能做错了什么?

更新 1: 这是commonWords文件的链接: https://drive.google.com/file/d/1VKNI-b883uQhfKLVg-L8QHgPTLNb22uS/view?usp=sharing

更新2:我在调试时注意到的一件事是 而(sc.hasNext((( 在 removeCommonEnglishWords(( 中被完全跳过。不过,我不明白为什么。

如果您像这样使用流,可能会更简单:

String mostPopularWord() {
return terms.stream()
.collect(Collectors.groupingBy(s -> s, Collectors.counting()))
.entrySet().stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.findFirst()
.map(Map.Entry::getKey)
.orElse("");
}

我试过你的代码。这是你必须做的。替换removeCommonEnglishWords()中的以下部分

Iterator<String> termIt = terms.iterator();
while(termIt.hasNext())
{
String term = termIt.next();
for(String word : commonWords)
if(!term.equalsIgnoreCase(word))
reducedTerms.add( term );
}

有了这个:

List<String> reducedTerms = new ArrayList<>();
for( String term : this.terms ) {
if( !commonWords.contains( term ) ) reducedTerms.add( term );
}
this.terms = reducedTerms;

由于您没有提供该类,因此我创建了一个带有一些假设的类,但我认为此代码会通过。

使用流的方法略有不同。

  1. 这使用相对常见的频率计数习惯用法,使用流并将它们存储在地图中。
  2. 然后,
  3. 它执行简单的扫描以查找获得的最大计数,然后返回 该单词或字符串"未找到单词"。
  4. 它还会过滤掉名为ignoreSet<String>中的单词,因此您也需要创建它。

import java.util.Arrays;
import java.util.Comparator;
import java.util.Map;
import java.util.Map.Entry;
import java.util.stream.Collectors;
Set<String> ignore = Set.of("the", "of", "and", "a",
"to", "in", "is", "that", "it", "he", "was",
"you", "for", "on", "are", "as", "with",
"his", "they", "at", "be", "this", "have",
"via", "from", "or", "one", "had", "by",
"but", "not", "what", "all", "were", "we",
"RT", "I", "&", "when", "your", "can",
"said", "there", "use", "an", "each",
"which", "she", "do", "how", "their", "if",
"will", "up", "about", "out", "many",
"then", "them", "these", "so", "some",
"her", "would", "make", "him", "into",
"has", "two", "go", "see", "no", "way",
"could", "my", "than", "been", "who", "its",
"did", "get", "may", "…", "@", "??", "I'm",
"me", "u", "just", "our", "like");

Map.Entry<String, Long> entry = terms.stream()
.filter(wd->!ignore.contains(wd)).map(String::trim)
.collect(Collectors.groupingBy(a -> a,
Collectors.counting()))
.entrySet().stream()
.collect(Collectors.maxBy(Comparator
.comparing(Entry::getValue)))
.orElse(Map.entry("No words found", 0L));

System.out.println(entry.getKey() + " " + entry.getValue());

最新更新