在PowerShell中将文本文件拆分为带有标头的tsv UTF-8



我使用以下代码将这个巨大的文件拆分为20K TSV UTF-8文件。

然而,我需要每个分割文件都有20k计数的头,我们该怎么做?

$sourceFile = "C:Userslingaguru.c3DesktopTestDE.txt"
$partNumber = 1
$batchSize = 20000
$pathAndFilename = "C:Userslingaguru.c3DesktopTestTemp part $partNumber file.tsv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001)  # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") 
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "C:Userslingaguru.c3DesktopTestTemp part $partNumber file.tsv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()

得到了答案。。

$sourceFile = "C:Userslingaguru.c3DesktopTestDE.txt"
# using a template filename saves writing
$pathOut    = "C:Userslingaguru.c3DesktopTestTemp part {0} file.tsv"
$partNumber = 1
$batchSize  = 20000  # max number of data lines to write in each part
# construct the output filename using the template $pathOut
$pathAndFilename = $pathOut -f $partNumber
$enc       = [System.Text.Encoding]::UTF8
$fs        = [System.IO.FileStream]::new($sourceFile,"Open", "Read")  # don't need write access on source file
$streamIn  = [System.IO.StreamReader]::new($fs, $enc)
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# assuming the first line contains the headers
$header = $streamIn.ReadLine()
# write out the header on the first part
$streamout.WriteLine($header)
$counter = 0
while (($line = $streamIn.ReadLine()) -ne $null) {
$streamout.WriteLine($line)
$counter++
if ($counter -ge $batchsize) {
$partNumber++
$counter = 0
$streamOut.Flush()
$streamOut.Dispose()
$pathAndFilename = $pathOut -f $partNumber
$streamout = [System.IO.StreamWriter]::new($pathAndFilename)
# write the header on this new part
$streamout.WriteLine($header)
}
}
$streamin.Dispose()
$streamout.Dispose()
$fs.Dispose()

使用get-content的readcount参数的一个稍微不同的示例。


1..100 | % { [pscustomobject]@{number=$_;name='a'*80} } | export-csv 100file.csv
get-content 100file.csv -ReadCount 20 | % { $i = 1 } { $_ | convertfrom-csv | 
export-csv ("$i" + '.csv'); $i++ }
dir

Directory: C:usersjsfoo

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----          5/7/2021   1:17 PM           1661 1.csv
-a----          5/7/2021   1:10 PM           8960 100file.csv
-a----          5/7/2021   1:17 PM           1831 2.csv
-a----          5/7/2021   1:17 PM           1831 3.csv
-a----          5/7/2021   1:17 PM           1831 4.csv
-a----          5/7/2021   1:17 PM           1831 5.csv
-a----          5/7/2021   1:17 PM            230 6.csv

我确认一个读取量为20kb的1Gg文件几乎没有使用内存。

1..10mb | % { [pscustomobject]@{number=$_;name='a'*80} } | export-csv bigfile.csv
get-content bigfile.csv -ReadCount 20kb | % { $i = 1 } { $_ | convertfrom-csv |
export-csv ("$i" + '.csv'); $i++ }

最新更新