PowerShell IO大文件而不将所有内容加载到内存



我有一个场景,我需要编辑非常大的文件,最终结果相当简单,但是实现它已经成为我的计算机和内存的一点拖累。由于下游系统的原因,我不能两次加载重复文件(根据计算的哈希值)。我的解决方法是将第一行实际记录移动到文件的末尾,而不更改任何其他内容。这种方法(如下面的方法1所示)对于足够小的文件非常有效,但是现在我的文件非常大。因此,我开始使用下面的方法2,但我还没有完全弄清楚如何将从输入文件流式传输行到输出文件。

#Method 1
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
$files = (Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File)
Foreach ($inputFile in $files){
$A = Get-Content $inputFile
$Header = $A[0]
$Data = $A[2..($A.Count-1)]
$Footer = $A[1]
$Header, $Data, $Footer | Add-Content -LiteralPath "$($inputFile.DirectoryName)$($Prefix)$($inputFile.BaseName).csv"
}
#Work-in-progress Method 2
$inputFile = "Input.csv"
$outputFile = "Output.csv"
#Create StringReader
$sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
#Create StringWriter
$sw = [System.IO.StringWriter]::New()
#Write the Header
$sw.Write($sr.ReadLine())
#Get the first actual record as a string
$lastLine = $sr.ReadLine()
#Write the rest of the lines
$sw.Write($sr.ReadToEnd())
#Add the final line
$sw.Write($lastLine)
#Write everything to the outputFile
[System.IO.File]::WriteAllText($outputFile, $sw.ToString())
Get-Content:
Line |
5 |  $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
|                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Insufficient memory to continue the execution of the program.
MethodInvocationException:
Line |
5 |  $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
|  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Exception calling ".ctor" with "1" argument(s): "Value cannot be null. (Parameter 's')"

我在理解StringWriter本身和StringBuilder之间的差异方面遇到了一点麻烦,例如-为什么我要选择使用StringWriter,而不是直接使用StringBuilder?最重要的是,方法2的当前迭代需要比我的系统更多的内存,而且它实际上并没有将字符/行/数据从输入文件流式传输到输出文件。是否存在我忽略的检查内存的内置方法,或者是否有更好的方法来实现我的目标?

PowerShell Pipeline的优点是它本质上是流的。
如果正确使用,意思是:

  • 不要将任何管道结果赋值给变量和
  • 不要使用括号

因为这会阻塞管道。

在你的例子中:

$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File |
Import-Csv |ForEach-Object -Begin { $Index = 0 } -Process {
if ($Index++) { $_ } else { $Footer = $_ }
} -End { $Footer } |
Export-Csv -LiteralPath "$($inputFile.DirectoryName)$($Prefix)$($inputFile.BaseName).csv"

以下是使用StreamReaderStreamWriter的代码:

Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File | ForEach-Object {
try {
$path   = "$($_.DirectoryName)$Prefix$($_.BaseName).csv"
$writer = [IO.StreamWriter] $path
$stream = $_.OpenRead()
$reader = [IO.StreamReader] $stream
$writer.WriteLine($reader.ReadLine()) # => This is header
$footer = $reader.ReadLine()
while(-not $reader.EndOfStream) {
$writer.WriteLine($reader.ReadLine())
}
$writer.WriteLine($footer)
}
finally {
$stream, $reader, $writer | ForEach-Object Dispose
}
}

此方法将使内存使用尽可能低,并且将尽可能高效。

如果您需要在许多大文件上更快一点,并且您确定您的csv数据是干净的,您也可以使用二进制IO.FileStream

基本上,下面的示例从文件的顶部取出一个核心示例,扫描页眉和页脚行。然后写入页眉,转储示例的其余部分,并使用流类的CopyTo而不是PowerShell的while循环来获得速度提升,最后写入页脚。
#asuming that .csv file lines end with CRLF ie bytes 13,10
# things go terribly wrong if this is not true
[byte]$ByteCR = 13 # 0D
[byte]$ByteLF = 10 # 0A
function Find-OffsetPastNextEOL{
param([System.Collections.IEnumerator]$enu)
$QuotedState = $false #ToDo: csv files can possibly have multiple lines per record
$CRLF_found  = $false
$count = 0
while($enu.MoveNext() -and !$CRLF_found){ #expected to be a lot less iterations than the number of lines in the file
$count++
if($enu.Current -eq $ByteCR -and $enu.MoveNext()){
$count++
$CRLF_found = $enu.Current -eq $ByteLF
}
}
return $count
}
function Test-EndOfFileHasEOL{
param([System.IO.FileStream]$read)
$null = $read.Seek(-2,'End')
return $read.ReadByte() -eq $ByteCR -and $read.ReadByte() -eq $ByteLF
}
$BufferSize = 100mb
$SampleSize =   1mb #idealy something just big enough to make sure you get the first two lines of every file
$SampleWithHeadAndFoot = new-object byte[] $SampleSize
Foreach ($inputFile in $files){
try{
#[IO.FileStream]::new(($IWantThis=$FullPath),($InorderTo='Open'),($IWill='Read'),($AtTheSameTimeOthersCan='Read'),($BytesAtOnce=$BufferSize))
$ReadFilePath  = $inputFile.FullName
$read  = [IO.FileStream]::new($ReadFilePath ,'Open'  ,'Read' ,'Read',$BufferSize)
$WriteFilePath = $ReadFilePath -Replace ($inputFile.Name+'$'),"$Prefix`$0"
$write = [IO.FileStream]::new($WriteFilePath,'Append','Write','None',$BufferSize)
$TotalBytesSampled = $read.Read($SampleWithHeadAndFoot, 0, $SampleSize)
#ToDo: check for BOM or other indicators that the csv data is one-byte ASCII or UTF8
$enu = $SampleWithHeadAndFoot.GetEnumerator()
$HeaderLength = 0 + (Find-OffsetPastNextEOL $enu)
$FooterLength = 1 + (Find-OffsetPastNextEOL $enu)
$DataStartPosition = $HeaderLength + $FooterLength
$OversampleLength  = $TotalBytesSampled - ($HeaderLength + $FooterLength)
$write.Write($SampleWithHeadAndFoot,0,$HeaderLength)             #write the header from the sample
if($DataStartPosition -lt $TotalBytesSampled - 1){               #flush the sample data after the first record
$write.Write($SampleWithHeadAndFoot,$DataStartPosition,$OversampleLength)
}
$read.CopyTo($write,$BufferSize)                                 #flush the rest of the data still in the read stream
if(!(Test-EndOfFileHasEOL $read)){                               #inject CRLF if EOF didn't already have one
$write.WriteByte($ByteCR)
$write.WriteByte($ByteLF)
}
$write.Write($SampleWithHeadAndFoot,$HeaderLength,$FooterLength) #write the first record as the footer
}finally{
$read.Dispose()
$write.Dispose()
}
}

我用下面的设置手动测试,但是您可能想要对生产代码做一些修改,使其对在野外发现的csv数据更健壮。

PS> sc xx.csv -Value @"
this is a header`r
this was line 1`r
this was line 2`r
this was line 3`r
this was line 4`r
this was line 5
"@ -Encoding utf8 -NoNewLine
PS> $Prefix     = '1_'
PS> $files      = ls xx.csv
PS> $SampleSize = 40
PS> Format-Hex .xx.csv
Path: .xx.csv
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000   EF BB BF 74 68 69 73 20 69 73 20 61 20 68 65 61  this is a hea
00000010   64 65 72 0D 0A 74 68 69 73 20 77 61 73 20 6C 69  der..this was li
00000020   6E 65 20 31 0D 0A 74 68 69 73 20 77 61 73 20 6C  ne 1..this was l
00000030   69 6E 65 20 32 0D 0A 74 68 69 73 20 77 61 73 20  ine 2..this was
00000040   6C 69 6E 65 20 33 0D 0A 74 68 69 73 20 77 61 73  line 3..this was
00000050   20 6C 69 6E 65 20 34 0D 0A 74 68 69 73 20 77 61   line 4..this wa
00000060   73 20 6C 69 6E 65 20 35                          s line 5
PS> Format-Hex .1_xx.csv
Path: .1_xx.csv
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000   EF BB BF 74 68 69 73 20 69 73 20 61 20 68 65 61  this is a hea
00000010   64 65 72 0D 0A 74 68 69 73 20 77 61 73 20 6C 69  der..this was li
00000020   6E 65 20 32 0D 0A 74 68 69 73 20 77 61 73 20 6C  ne 2..this was l
00000030   69 6E 65 20 33 0D 0A 74 68 69 73 20 77 61 73 20  ine 3..this was
00000040   6C 69 6E 65 20 34 0D 0A 74 68 69 73 20 77 61 73  line 4..this was
00000050   20 6C 69 6E 65 20 35 0D 0A 74 68 69 73 20 77 61   line 5..this wa
00000060   73 20 6C 69 6E 65 20 31 0D 0A                    s line 1..

相关内容

  • 没有找到相关文章

最新更新