我使用以下代码将这个巨大的文件拆分为20K TSV UTF-8文件。
然而,我需要每个分割文件都有20k计数的头,我们该怎么做?
$sourceFile = "C:Userslingaguru.c3DesktopTestDE.txt"
$partNumber = 1
$batchSize = 20000
$pathAndFilename = "C:Userslingaguru.c3DesktopTestTemp part $partNumber file.tsv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None")
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "C:Userslingaguru.c3DesktopTestTemp part $partNumber file.tsv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()
得到了答案。。
$sourceFile = "C:Userslingaguru.c3DesktopTestDE.txt"
# using a template filename saves writing
$pathOut = "C:Userslingaguru.c3DesktopTestTemp part {0} file.tsv"
$partNumber = 1
$batchSize = 20000 # max number of data lines to write in each part
# construct the output filename using the template $pathOut
$pathAndFilename = $pathOut -f $partNumber
$enc = [System.Text.Encoding]::UTF8
$fs = [System.IO.FileStream]::new($sourceFile,"Open", "Read") # don't need write access on source file
$streamIn = [System.IO.StreamReader]::new($fs, $enc)
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# assuming the first line contains the headers
$header = $streamIn.ReadLine()
# write out the header on the first part
$streamout.WriteLine($header)
$counter = 0
while (($line = $streamIn.ReadLine()) -ne $null) {
$streamout.WriteLine($line)
$counter++
if ($counter -ge $batchsize) {
$partNumber++
$counter = 0
$streamOut.Flush()
$streamOut.Dispose()
$pathAndFilename = $pathOut -f $partNumber
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# write the header on this new part
$streamout.WriteLine($header)
}
}
$streamin.Dispose()
$streamout.Dispose()
$fs.Dispose()
使用get-content的readcount参数的一个稍微不同的示例。
1..100 | % { [pscustomobject]@{number=$_;name='a'*80} } | export-csv 100file.csv
get-content 100file.csv -ReadCount 20 | % { $i = 1 } { $_ | convertfrom-csv |
export-csv ("$i" + '.csv'); $i++ }
dir
Directory: C:usersjsfoo
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 5/7/2021 1:17 PM 1661 1.csv
-a---- 5/7/2021 1:10 PM 8960 100file.csv
-a---- 5/7/2021 1:17 PM 1831 2.csv
-a---- 5/7/2021 1:17 PM 1831 3.csv
-a---- 5/7/2021 1:17 PM 1831 4.csv
-a---- 5/7/2021 1:17 PM 1831 5.csv
-a---- 5/7/2021 1:17 PM 230 6.csv
我确认一个读取量为20kb的1Gg文件几乎没有使用内存。
1..10mb | % { [pscustomobject]@{number=$_;name='a'*80} } | export-csv bigfile.csv
get-content bigfile.csv -ReadCount 20kb | % { $i = 1 } { $_ | convertfrom-csv |
export-csv ("$i" + '.csv'); $i++ }