所以基本上这是一个解析器/余弦矩阵计算器,但我不断收到编译错误。我想我有正确读取文本文件的输入路径。但它仍然无法编译。
这是我的主要课程:
import java.io.FileNotFoundException;
import java.io.IOException;
public class TfIdfMain {
public static void main(String args[]) throws FileNotFoundException, IOException {
DocumentParser dp = new DocumentParser();
dp.parseFiles("C:/Users/dachen/Documents/doc1.txt"); // give the location of source file
dp.tfIdfCalculator(); //calculates tfidf
dp.getCosineSimilarity(); //calculates cosine similarity
}
}
我的解析器类:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class DocumentParser {
//This variable will hold all terms of each document in an array.
private List<String[]> termsDocsArray = new ArrayList<String[]>();
private List<String> allTerms = new ArrayList<String>(); //to hold all terms
private List<double[]> tfidfDocsVector = new ArrayList<double[]>();
/**
* Method to read files and store in array.
*/
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
in = new BufferedReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String s = null;
while ((s = in.readLine()) != null) {
sb.append(s);
}
String[] tokenizedTerms = sb.toString().replaceAll("[\W&&[^\s]]", "").split("\W+"); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
/**
* Method to create termVector according to its tfidf score.
*/
public void tfIdfCalculator() {
double tf; //term frequency
double idf; //inverse document frequency
double tfidf; //term requency inverse document frequency
for (String[] docTermsArray : termsDocsArray) {
double[] tfidfvectors = new double[allTerms.size()];
int count = 0;
for (String terms : allTerms) {
tf = new TfIdf().tfCalculator(docTermsArray, terms);
idf = new TfIdf().idfCalculator(termsDocsArray, terms);
tfidf = tf * idf;
tfidfvectors[count] = tfidf;
count++;
}
tfidfDocsVector.add(tfidfvectors); //storing document vectors;
}
}
/**
* Method to calculate cosine similarity between all the documents.
*/
public void getCosineSimilarity() {
for (int i = 0; i < tfidfDocsVector.size(); i++) {
for (int j = 0; j < tfidfDocsVector.size(); j++) {
System.out.println("between " + i + " and " + j + " = "
+ new CosineSimilarity().cosineSimilarity
(
tfidfDocsVector.get(i),
tfidfDocsVector.get(j)
)
);
}
}
}
}
这是我的错误:
Exception in thread "main" java.lang.NullPointerException
at DocumentParser.parseFiles(DocumentParser.java:22)
at TfIdfMain.main(TfIdfMain.java:7)
文档中文本文件的路径是否错误?
Windows 文件路径应该使用 而不是
/
.此外,这里还有另一个错误,代码不需要整个文件路径,只需要目录路径。所以而不是
dp.parseFiles("C:/Users/dachen/Documents/doc1.txt");
应该是
dp.parseFiles("C:\Users\dachen\Documents");
listFiles()
的文档指出:
如果此抽象路径名不表示目录,则返回 null
要传递的路径不是目录。