我正在做一个比较两个大型文本文件版本(大约5000多行文本(的项目。较新的版本可能包含新的和已删除的内容。它旨在帮助检测团队从文本中接收信息时文本版本的早期更改
为了解决这个问题,我使用diff-match补丁库,它允许我识别已经删除的内容和新内容。在第一步中,我搜索更改。
public void compareStrings(String oldText, String newText){
DiffMatchPatch dmp = new DiffMatchPatch();
LinkedList<Diff> diffs = dmp.diffMain(previousString, newString, false);
}
然后,我通过关键字INSERT/DELETE对列表进行筛选,只获得新的/删除的内容。
public String showAddedElements(){
String insertions = "";
for(Diff elem: diffs){
if(elem.operation == Operation.INSERT){
insertions = insertions + elem.text + System.lineSeparator();
}
}
return insertions;
}
然而,当我输出内容时,有时我只得到单个字母,比如(o,contr,ler(,而只删除/添加了单个字符。相反,我想输出发生变化的整句话。有没有办法从发生更改的DiffMatchPatch中检索行号?
我通过使用另一个库进行行提取找到了解决方案。DiffUtils(DMitry Maumenko的Class DiffUtils(帮助我实现了预期目标。
/**
* Converts a String to a list of lines by dividing the string at linebreaks.
* @param text The text to be converted to a line list
*/
private List<String> fileToLines(String text) {
List<String> lines = new LinkedList<String>();
Scanner scanner = new Scanner(text);
while (scanner.hasNext()) {
String line = scanner.nextLine();
lines.add(line);
}
scanner.close();
return lines;
}
/**
* Starts a line-by-line comparison between two strings. The results are included
* in an intern list element for further processing.
*
* @param firstText The first string to be compared
* @param secondText The second string to be compared
*/
public void startLineByLineComparison(String firstText, String secondText){
List<String> firstString = fileToLines(firstText);
List<String> secondString = fileToLines(secondText);
changes = DiffUtils.diff(firstString, secondString).getDeltas();
}
插入后,可以使用以下代码提取带有更改的列表,而elem.getType((表示文本之间的差异类型:
/**
* Returns a String filled with all removed content including line position
* @return String with removed content
*/
public String returnRemovedContent(){
String deletions = "";
for(Delta elem: changes){
if(elem.getType() == TYPE.DELETE){
deletions = deletions + appendLines(elem.getOriginal()) + System.lineSeparator();
}
}
return deletions;
}