使用java从文本文件中删除重复行



我可以一行一行地读取文本文件,直到分隔符--,并将这些行写入新文件吗?

然后我要读取两个分隔符--之间的下一个行块,并将它们与前面的行块进行比较。

三个或以上行重复,然后不将它们写入文件。

依次类推,直到结尾。

public void removeDuplicateErr(String data) throws IOException {
String contents = new String(Files.readAllBytes(Paths.get(data)));
String[] blocks = contents.split("--");
String fileName = "output.txt";
PrintWriter pw = new PrintWriter(fileName);
int count = 0;
int count1 = 0;
for (String block : blocks) {
boolean flag = false;
if(count > 0) {
String contents1 = new String(Files.readAllBytes(Paths.get(fileName)));
String[] blocks1 = contents1.split("--");
for(String block1 : blocks1) {
BufferedReader br1 = new BufferedReader(new StringReader(block1));
String line1 = br1.readLine();
while (line1 != null) {
BufferedReader br2 = new BufferedReader(new StringReader(block));
String line2 = br2.readLine();
while (line2 != null) {
if(line1.equals(line2)) {
count1++;
if(count1 >= 3) {
flag = true;
break;
}
}
line2 = br2.readLine();
}
line1 = br1.readLine();
}
if (!flag) {
pw.print(block);
pw.print("--");
pw.flush();
}
}
}
if(count < 1) {
pw.print(block);
pw.print("--");
pw.flush();
}
count++;
}
pw.close();
}

输入示例

test 1
test 2
test 3
test 4
test 5
--
test 6
test 2
test 3
test 4
test 12
--
test 8
test 9 
test 10 
test 11 
test 12
--
test 1
test 3
test 4 
test 21
test 22
--
test 1
test 2
test 3
test 4 
test 5
--
test 50
test 51
test 52 
test 53
test 54 
test 55
--
test 53
test 54
test 55
test 56
test 57

预期结果
test 1
test 2
test 3
test 4
test 5
--
test 8 
test 9 
test 10 
test 11 
test 12
--
test 50 
test 51 
test 52
test 53
test 54
test 55

很棒的起始代码(在您的原始版本中)。

(!)免责声明:在别人已经回答过之后,更改问题中的代码是不太支持的。更糟糕的是,当你使用给定的答案来改进你的代码。

你的代码/问题的变化,迫使所有的答案也改变他们的解决方案。


问题"duplicate"块:

该块的3行或更多行[出现在其他块中]

(注:我在括号"[]"内添加了一些说明)

一份配料配方

  • 读取单输入文件整体成单一缓冲String contents = new String(Files.readAllBytes(Paths.get("/path/to/file")));
  • 将内容提取到(由--分离)使用String[] blocks = contents.split("--")
  • 逐块循环:for (block in blocks) { blocks_out.append( deduplicateFrom(block, blocks); }
  • 将结果块写回文件(记住在每个块后面附加分隔符)

现在详细说明关键部分:blocks_out.append( deduplicateFrom(block, blocks);

  • 在循环中比较当前块与其他每个块
  • 比较可以测试你的标准(3行或更多行重复)
  • 如果没有发现重复,则将(重复数据删除)块添加到结果

程序可以记录的内容

下面是在控制台上打印的操作:

Split input into 7 blocks. Comparing them for duplicates in others. 
comparing block 0
duplicate lines of block 0 in 1: 3
rejecting 'duplicated' block 0
comparing block 1
duplicate lines of block 1 in 0: 3
rejecting 'duplicated' block 1
comparing block 2
duplicate lines of block 2 in 0: 0
duplicate lines of block 2 in 1: 1
duplicate lines of block 2 in 3: 0
duplicate lines of block 2 in 4: 0
duplicate lines of block 2 in 5: 0
duplicate lines of block 2 in 6: 0
adding 'unique' block 2
comparing block 3
duplicate lines of block 3 in 0: 3
rejecting 'duplicated' block 3
comparing block 4
duplicate lines of block 4 in 0: 5
rejecting 'duplicated' block 4
comparing block 5
duplicate lines of block 5 in 0: 0
duplicate lines of block 5 in 1: 0
duplicate lines of block 5 in 2: 0
duplicate lines of block 5 in 3: 0
duplicate lines of block 5 in 4: 0
duplicate lines of block 5 in 6: 3
rejecting 'duplicated' block 5
comparing block 6
duplicate lines of block 6 in 0: 0
duplicate lines of block 6 in 1: 0
duplicate lines of block 6 in 2: 0
duplicate lines of block 6 in 3: 0
duplicate lines of block 6 in 4: 0
duplicate lines of block 6 in 5: 3
rejecting 'duplicated' block 6
Total 'unique' blocks in output: 1

我能识别的唯一唯一的块是数字:8到12。其中只有12出现为重复的从前块(1重复行)。

在SO或web上研究所有成分

  • 如何在java中读取一个文件成字符串?

为第一个项目/成分,依此类推。

这是一个实现的建议:

public class BlockFilter {
public void removeDuplicateErr(String data) throws IOException {
String contents = new String(Files.readAllBytes(Paths.get(data)));
String fileName = "output.txt";
Files.writeString(Paths.get(fileName), getFilteredContent(contents));
// TODO this method is not yet tested (but getFilteredContent() below)
}
public String getFilteredContent(String fileContent) {
return getValidBlocks(fileContent).stream()
.map(Block::getAsString)
.collect(Collectors.joining("n--"));
}
private Collection<Block> getValidBlocks(String fileContent) {
// create a list of Blocks...
List<Block> allBlocks = Stream.of(fileContent.split("--"))
.map(Block::new)
.collect(Collectors.toList());
// ...and collect all, that are valid
return allBlocks.stream()
.filter(block -> isBlockValid(allBlocks, block))
.collect(Collectors.toList());
}
private boolean isBlockValid(List<Block> allBlocks, Block block) {
for (Block otherBlock : allBlocks) {
if (otherBlock == block) {
// we've reached 'block' and did not hit the invalid criterion so far
return true;
} else if (block.countEqualLines(otherBlock) >= 3) {
// the block is invalid
return false;
}
}
throw new IllegalArgumentException("Argument 'block' must be contained in 'allBlocks'!");
}
private static class Block {
private List<String> lines;
private Block(String block) {
Objects.requireNonNull(block);
lines = Stream.of(block.split("n")).collect(Collectors.toList());
}
private long countEqualLines(Block otherBlock) {
return otherBlock.lines.stream()
.filter(line -> lines.contains(line))
.count();
}
private String getAsString() {
return lines.stream().collect(Collectors.joining("n"));
}
}
}

至少通过以下单元测试:;-)

class BlockFilterTest {

@org.junit.jupiter.api.Test
void getValidBlocks() {

// Java 15 Multi-line String
String fileContent = """
test 1
test 2
test 3
test 4
test 5
--
test 6
test 2
test 3
test 4
test 12
--
test 8
test 9
test 10
test 11
test 12
--
test 1
test 3
test 4
test 21
test 22
--
test 1
test 2
test 3
test 4
test 5
--
test 50
test 51
test 52
test 53
test 54
test 55
--
test 53
test 54
test 55
test 56
test 57""";

String expected = """
test 1
test 2
test 3
test 4
test 5
--
test 8
test 9
test 10
test 11
test 12
--
test 50
test 51
test 52
test 53
test 54
test 55""";

assertEquals(expected, new BlockFilter().getFilteredContent(fileContent));
}
}

注意,因为BlockFilter没有状态,所以所有的方法也可以是静态的。

相关内容

  • 没有找到相关文章

最新更新