如何准确地计算文件中的句子数量?
我的文件中有一个文本。有7个句子,但我的代码显示有9个句子。
String path = "C:/CT_AQA - Copy/src/main/resources/file.txt";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path)));
String line;
int countWord = 0;
int sentenceCount = 0;
int characterCount = 0;
int paragraphCount = 0;
int countNotLetter = 0;
int letterCount = 0;
int wordInParagraph = 0;
List<Integer> wordsPerParagraph = new ArrayList<>();
while ((line = br.readLine()) != null) {
if (line.equals("")) {
paragraphCount++;
wordsPerParagraph.add(wordInParagraph);
System.out.printf("In %d paragraph there are %d wordsn", paragraphCount, wordInParagraph);
wordInParagraph = 0;
} else {
characterCount += line.length();
String[] wordList = line.split("[\s—]");
countWord += wordList.length;
wordInParagraph += wordList.length;
String[] letterList = line.split("[^a-zA-Z]");
countNotLetter += letterList.length;
String[] sentenceList = line.split("[.:]");
sentenceCount += sentenceList.length;
}
letterCount = characterCount - countNotLetter;
}
if (wordInParagraph != 0) {
wordsPerParagraph.add(wordInParagraph);
}
br.close();
System.out.println("The amount of words are " + countWord);
System.out.println("The amount of sentences are " + sentenceCount);
System.out.println("The amount of paragraphs are " + paragraphCount);
System.out.println("The amount of letters are " + letterCount);
您的代码看起来工作正常,尽管它并没有处处遵循最佳实践。
我怀疑得到错误答案的根本原因是计算句子结尾的正则表达式不准确。代码计算以点或冒号结尾的句子。问题在这一行:
String[] sentenceList = line.split("[.:]");
但是冒号不是句子的结尾,而且句子还会以其他字符(感叹号、问号、省略号)结尾。在我的评估中,这种模式更准确:
"[!?.]+(?=$|\s)"
并显示您得到错误结果的文件的内容。那么你就有可能相信我的假设了。
只计算文件中句子数的完整代码:
int sentenceCount = 0;
while ((line = br.readLine()) != null) {
if (!"".equals(line)) {
String[] sentencesArray = line.split("[!?.]+(?=$|\s)");
sentenceCount += sentencesArray.length;
}
}
br.close();
System.out.println("The amount of sentences are " + sentenceCount);
您可能正在拾取语句上的尾随空格,这将为您的数组添加额外的值。您可以在使用replaceAll("\s+", "")
为句子split
之前删除line
中的空白。
修改后的代码如下:
String[] sentenceList = line.replaceAll("\s+","").split("[.:]");
我没有改变你定义句子的方式,但是,!
和?
显然也可以作为句子分隔符。