我有两个大文件要比较(超过10 GB(。下面的命令适用于小文件,但似乎占用了我机器上的RAM空间。
如何在不消耗大量内存的情况下获得两个文件的差异?
任何想法都将不胜感激。
robocopy.exe C:Folder C:Folder /l /nocopy /is /e /fp /ns /nc /njh /njs /tee /log:c:tempFolderList.txt
$path = 'C:Folder'
$pattern = [regex]::Escape($path)
$newContent = @()
Get-Content -Path "c:tempFolderList.txt" | ForEach-Object {$newContent += $_ -replace $pattern, ''}
Set-Content -Path "c:tempFolderList.txt" -Value $newContent
(Get-Content C:tempFolderList.txt).Trim() -ne '' | Set-Content C:tempFolderList.txt
robocopy.exe C:Folder2 C:Folder2 /l /nocopy /is /e /fp /ns /nc /njh /njs /tee /log:c:tempFolderList2.txt
$path = 'C:Folder2'
$pattern = [regex]::Escape($path)
$newContent = @()
Get-Content -Path "c:tempFolderList2.txt" | ForEach-Object {$newContent += $_ -replace $pattern, ''}
Set-Content -Path "c:tempFolderList2.txt" -Value $newContent
(Get-Content C:tempFolderList2.txt).Trim() -ne '' | Set-Content C:tempFolderList2.txt
Compare-Object -ReferenceObject (Get-Content c:tempFolderList.txt) -DifferenceObject (Get-Content c:tempFolderList2.txt)
最后更新
Folderlist.txt
C:FolderData2Documents
C:FolderData2Documents1.txt
C:FolderData2Documents2.txt
C:FolderData2Documents3.txt
C:FolderData2Documents4.txt
C:FolderData2Documents5.txt
比较Log1.text
Data2Documents
C:FolderData2Documents
Data2Documents1.txt
C:FolderData2Documents1.txt
Data2Documents2.txt
C:FolderData2Documents2.txt
Data2Documents3.txt
C:FolderData2Documents3.txt
Data2Documents4.txt
C:FolderData2Documents4.txt
Data2Documents5.txt
C:FolderData2Documents5.txt
期望输出:
Data2Documents
Data2Documents1.txt
Data2Documents2.txt
Data2Documents3.txt
Data2Documents4.txt
Data2Documents5.txt
更新-2:
输出:
Data2Documents
C:FolderData2Documents
Data2Documents1.txt
C:FolderData2Documents1.txt
Data2Documents2.txt
C:FolderData2Documents2.txt
Data2Documents3.txt
C:FolderData2Documents3.txt
Data2Documents4.txt
C:FolderData2Documents4.txt
Data2Documents5.txt
C:FolderData2Documents5.txt
首先,使用+=
向数组添加内容是一种已知的内存占用,因为数组有固定的长度,当您向其中添加新元素时,需要在内存中重建完整的数组。
因此,对于每个日志文件的替换和删除空行,我建议这样做:
robocopy.exe C:Folder C:Folder /l /nocopy /is /e /fp /ns /nc /njh /njs /tee /log:c:tempFolderList.txt
robocopy.exe C:Folder2 C:Folder2 /l /nocopy /is /e /fp /ns /nc /njh /njs /tee /log:c:tempFolderList2.txt
$path = 'C:Folder'
$newFile = 'C:tempCompareLog_1.txt' # have it create a new file instead of gathering all 10Gb in memory
$pattern = [regex]::Escape($path)
# use 'switch' to parse the log file line-by-line
# and write the processed lines to the new file.
# this will be lean on mmory, but takes a lot of disk write actions..
switch -Regex -File 'C:tempFolderList.txt' {
$pattern { Add-Content $newFile -Value ($_ -replace $pattern).Trim() }
default { if ($_ -match 'S') { Add-Content $newFile -Value $_.Trim() }} # non-empty or whitespace-only lines
}
对于第二个日志文件:
$path = 'C:Folder2'
$newFile = 'C:tempCompareLog_2.txt'
$pattern = [regex]::Escape($path)
switch -Regex -File 'C:tempFolderList2.txt' {
$pattern { Add-Content $newFile -Value ($_ -replace $pattern).Trim() }
default { if ($_ -match 'S') { Add-Content $newFile -Value $_.Trim() }}
}
接下来,您需要比较新文件CompareLog_1.txt
和CompareLog_2.txt
,但我想这些文件可能仍然很大,因此我同意Zilog80最好使用专用软件。
根据您希望看到的结果,您也可以考虑使用旧的fc.exe
,它工作速度快,不需要占用内存
类似的东西
fc.exe /C /N 'C:tempCompareLog_1.txt' 'C:tempCompareLog_2.txt'
您可以不使用Add-Content
,而是使用StreamWriter来加快要比较的文件的写入速度:(这将创建一个Utf8NoBOM编码的文件(
$path = 'C:Folder'
$newFile = 'C:tempCompareLog_1.txt'
$writer = [System.IO.StreamWriter]::new($newFile)
$pattern = [regex]::Escape($path)
switch -Regex -File 'C:tempFolderList.txt' {
$pattern { $writer.WriteLine(($_ -replace $pattern).Trim()) }
default { if ($_ -match 'S') { $writer.WriteLine($_.Trim()) }}
}
# clean up
$writer.Flush()
$writer.Dispose()
$path = 'C:Folder2'
$newFile = 'C:tempCompareLog_2.txt'
$writer = [System.IO.StreamWriter]::new($newFile)
$pattern = [regex]::Escape($path)
switch -Regex -File 'C:tempFolderList2.txt' {
$pattern { $writer.WriteLine(($_ -replace $pattern).Trim()) }
default { if ($_ -match 'S') { $writer.WriteLine($_.Trim()) }}
}
# clean up
$writer.Flush()
$writer.Dispose()