我试图运行一个调解工作,我需要从一个(巨大的)文件中读取输入字符串,需要验证一个特定的目录是否有一个文件名以输入字符串开头的文件。目录(ies)为"/",容量很大,最多可包含80万个文件。考虑到这一点,我只使用File[] directoryListing
一次,然后针对它迭代输入文件的行。下面是代码:
public class CheckForFile {
public static void main(String[] args) {
String dirPath = "W:\ThePath\ToThe\Directory";
File dir = new File(dirPath);
if (dir.isDirectory() && dir.exists()) {
File[] directoryListing = dir.listFiles();
String line;
try (BufferedReader br = new BufferedReader(new FileReader("input.csv"))) {
while ((line = br.readLine()) != null) {
String[] strArray = line.split(",", -1);
System.out.println(fileExistsInDir(strArray[0], directoryListing));
}
} catch (IOException e) {
e.printStackTrace();
}
} else {
System.out.println(dirPath + " - is not a Directory.. ");
}
}
public static String fileExistsInDir(String fileNameStartsWithStr, File[] directoryListing) {
if (directoryListing != null) {
for (File child : directoryListing) {
if (child.isFile()) {
if (child.getName().startsWith(fileNameStartsWithStr)) {
} else {
return "file DO NOT exist for - " + fileNameStartsWithStr;
}
}
}
} else {
System.out.println("directoryListing empty...");
}
return null;
}
}
我正在检查哪个文件对input.csv文件中的条目丢失。以上代码工作正常。但是由于这是一个远程windows共享目录路径,因此需要一段时间才能获得文件列表。有没有更好的方法来做这一切?这里的请求是在控制台中查看file DO NOT exist for - foobar
。如有任何建议,我将不胜感激。
更新:目录中只有少数文件丢失,但在input.csv中列出。
问题陈述:需要根据列表找出目录中缺少的文件
更新2:根据DuncG的解决方案,我尝试了这个:
public static void main(String[] args) throws IOException {
Instant start = Instant.now();
try (Stream<String> lines = Files.lines(Paths.get("input.csv"))) {
Set<String> scanfor = lines
.map(line -> line.split(",", -1))
.filter(line -> line.length > 0)
.map(line -> line[0])
.filter(s -> s.length() > 0)
.collect(Collectors.toSet());
System.out.println("scanfor size: " + scanfor.size());
try (Stream<Path> scan = Files.find(Paths.get("W:\ThePath\ToThe\Directory"),
1, (p, a) -> !a.isDirectory() && !matches(p.getFileName().toString(), scanfor))) {
long count = scan.peek(System.out::println).count();
System.out.println("Number of files not matching CSV criteria: " + count);
}
}
Instant finish = Instant.now();
long timeElapsed = Duration.between(start, finish).toMinutes();
System.out.println("Total time consumed :"+ timeElapsed );
}
private static boolean matches(String fn, Set<String> scanfor) {
// Search by exact match in the set
for (int i = fn.length(); i >= 1; i--) {
if (scanfor.contains(fn.substring(0, i)))
return true;
}
return false;
}
我从一半的文件记录开始。我的控制台显示:scanfor size: 472948
。现在它似乎一直在播放,我已经等了30多分钟了。这里有什么问题吗?
Update3:
我按照DuncG的建议试过了:
public static void main(String[] args) throws IOException {
Instant start = Instant.now();
System.out.println(start);
try (Stream<String> lines = Files.lines(Paths.get("input.csv"))) {
Set<String> scanfor = lines.map(line -> line.split(",", -1)).filter(line -> line.length > 0)
.map(line -> line[0]).filter(s -> s.length() > 0).collect(Collectors.toSet());
IntSummaryStatistics stats = scanfor.stream().mapToInt(String::length).summaryStatistics();
System.out.println("scanfor stats: " + stats);
Path out = Paths.get("app.log");
try (BufferedWriter os = Files.newBufferedWriter(out, StandardCharsets.UTF_8, StandardOpenOption.WRITE);
Stream<Path> scan = Files.find(
Paths.get("W:\ThePath\ToThe\Directory"), 1,
(p, a) -> !a.isDirectory() && !matches(p.getFileName().toString(), scanfor, stats))) {
scan.map(Path::toString).forEach(s -> write(os, s));
}
System.out.println("saved as: " + out);
}
Instant finish = Instant.now();
System.out.println(finish);
long timeElapsed = Duration.between(start, finish).toMinutes();
System.out.println("Total time consumed in Minutes :" + timeElapsed);
}
private static void write(BufferedWriter wr, String s) {
try {
wr.write(s);
wr.newLine();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
private static boolean matches(String fn, Set<String> scanfor, IntSummaryStatistics stats) {
// Can search by exact match in the set knowing the smallest/largest string of
// scanfor
for (int i = stats.getMin(), max = Math.min(fn.length(), stats.getMax()); i <= max; i++) {
if (scanfor.contains(fn.substring(0, i)))
return true;
}
return false;
}
,得到如下输出:
2021-10-31T18:13:02.733379900Z
scanfor stats: IntSummaryStatistics{count=472948, sum=17972024, min=38, average=38.000000, max=38}
saved as: app.log
2021-10-31T18:53:39.232551600Z
Total time consumed in Minutes :40
与更新2相比没有多少增益。用了差不多一样的时间
在扫描大型文件系统时,File IO类不是非常快。正如您所注意到的,dir.listFiles()
调用非常慢,因为它检查目录中的每个名称并实例化800,000项数组。Files NIO包在处理大目录流方面要好得多,因为在选择文件或文件夹时,像Files.find
这样的调用会很快返回结果。
所以:如果CSV文件是可管理的大小加载在一个步骤,你可以加载匹配字符串首先到一个集合,然后做一个(深度=1)目录扫描抓取所有的文件-一个简单的谓词在find
跳过过去的目录和检查CSV中的匹配。
try(var lines = Files.lines(csv)) {
Set<String> scanfor = lines.map(line -> line.split(",", -1))
.filter(line -> line.length > 0)
.map(line -> line[0])
.filter(s -> s.length() > 0)
.collect(Collectors.toSet());
IntSummaryStatistics stats = scanfor.stream().mapToInt(String::length).summaryStatistics();
System.out.println("scanfor stats: "+stats);
try(var os = Files.newBufferedWriter(out);
var scan = Files.find(dir, 1, (p,a) -> !a.isDirectory() && !matches(p.getFileName().toString(), scanfor, stats))) {
scan.map(Path::toString).forEach(s -> write(os, s));
}
System.out.println("saved as: "+out);
}
private static void write(BufferedWriter wr, String s) {
try
{
wr.write(s);
wr.newLine();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
private static boolean matches(String fn, Set<String> scanfor, IntSummaryStatistics stats) {
// Can search by exact match in the set knowing the smallest/largest string of scanfor
for (int i = stats.getMin(), max = Math.min(fn.length(), stats.getMax()); i <= max ; i++) {
if (scanfor.contains(fn.substring(0, i)))
return true;
}
return false;
}
编辑我刚刚重读了你的问题,它原来找到了匹配。您可以决定查找与find
谓词中的matches
或!matches
匹配或不匹配CSV标准的文件。
我使用下面的FileVisitor
来返回文件的List
:
public static List<String> getFilesList() {
String dirPath = "W:\ThePath\ToThe\Directory";
List<String> filesList = new ArrayList<String>();
FileVisitor<Path> simpleFileVisitor = new SimpleFileVisitor<Path>() {
@Override
public FileVisitResult visitFile(Path visitedFile, BasicFileAttributes fileAttributes) throws IOException {
filesList.add(visitedFile.getFileName().toString());
return FileVisitResult.CONTINUE;
}
};
FileSystem fileSystem = FileSystems.getDefault();
Path rootPath = fileSystem.getPath(dirPath);
try {
Files.walkFileTree(rootPath, simpleFileVisitor);
} catch (IOException ioe) {
ioe.printStackTrace();
}
return filesList;
}
我在34年拿到了名单。x分钟。