我有大约 2500 个 CSV 文件,每个文件在文件大小方面约为 20MB。我正在尝试从每个文件中过滤掉某些行并将其保存到新文件中。
所以,如果我有:
File 1 :
Row1
Row2
Row3
File 2 :
Row2
Row3
and so on..
如果我过滤所有文件并选择"Row2"作为过滤器文本,则新文件夹应包含所有文件,其中只有与过滤器文本匹配的行。
浏览一些论坛,我想出了以下内容,可能有助于我过滤行,但我不确定如何递归地做到这一点,而且我也不知道这是否是一种足够快的方法。任何帮助,不胜感激。
Get-Content "C:Path to file" | Where{$_ -match "Rowfiltertext*"} | Out-File "Path to Out file"
我正在使用Windows,所以我想Powershell类型的解决方案在这里将是最好的。
要过滤的文本将始终位于第一列中。
谢谢 西丹特
以下是在(文本)文件中搜索字符串的两种快速方法:
1) 使用开关
$searchPattern = [regex]::Escape('Rowfiltertext') # for safety escape regex special characters
$sourcePath = 'X:PathToTheCsvFiles'
$outputPath = 'X:FilteredCsv.txt'
# if you also need to search inside subfolders, append -Recurse to the Get-ChildItem cmdlet
Get-ChildItem -Path $sourcePath -Filter '*.csv' -File | ForEach-Object {
# iterate through the lines in the file and output the ones that match the search pattern
switch -Regex -File $_.FullName {
$searchPattern { $_ }
}
} | Set-Content -Path $outputPath # add -PassThru to also show on screen
2) 使用选择字符串
$searchPattern = [regex]::Escape('Rowfiltertext') # for safety escape regex special characters
$sourcePath = 'X:PathToTheCsvFiles'
$outputPath = 'X:FilteredCsv.txt'
# if you also need to search inside subfolders, append -Recurse to the Get-ChildItem cmdlet
Get-ChildItem -Path $sourcePath -Filter '*.csv' -File | ForEach-Object {
($_ | Select-String -Pattern $searchPattern).Line
} | Set-Content -Path $outputPath # add -PassThru to also show on screen
如果您想为每个原始文件输出一个新的 csv 文件,
用:
3)使用开关
$searchPattern = [regex]::Escape('Rowfiltertext') # for safety escape regex special characters
$sourcePath = 'X:PathToTheCsvFiles'
$outputPath = 'X:FilteredCsv'
if (!(Test-Path -Path $outputPath -PathType Container)) {
$null = New-Item -Path $outputPath -ItemType Directory
}
# if you also need to search inside subfolders, append -Recurse to the Get-ChildItem cmdlet
(Get-ChildItem -Path $sourcePath -Filter '*.csv' -File) | ForEach-Object {
# create a full target filename for the filtered output csv
$outFile = Join-Path -Path $outputPath -ChildPath ('New_{0}' -f $_.Name)
# iterate through the lines in the file and output the ones that match the search pattern
$result = switch -Regex -File $_.FullName {
$searchPattern { $_ }
}
$result | Set-Content -Path $outFile # add -PassThru to also show on screen
}
4) 使用选择字符串
$searchPattern = [regex]::Escape('Rowfiltertext') # for safety escape regex special characters
$sourcePath = 'X:PathToTheCsvFiles'
$outputPath = 'X:FilteredCsv'
# if you also need to search inside subfolders, append -Recurse to the Get-ChildItem cmdlet
(Get-ChildItem -Path $sourcePath -Filter '*.csv' -File) | ForEach-Object {
# create a full target filename for the filtered output csv
$outFile = Join-Path -Path $outputPath -ChildPath ('New_{0}' -f $_.Name)
($_ | Select-String -Pattern $searchPattern).Line | Set-Content -Path $outFile # add -PassThru to also show on screen
}
希望有帮助
Re. "fast enough method":
Get-Content 非常慢。 你可以使用"System.IO.StreamReader"代替,即将完整的文件内容读取成一个字符串,然后将这个字符串分成行,依此类推,例如:
[System.IO.FileStream]$objFileStream = New-Object System.IO.FileStream($Csv.FullName, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
[System.IO.StreamReader]$objStreamReader = New-Object System.IO.StreamReader($objFileStream, [System.Text.Encoding]::UTF8)
$strFileContent = ($objStreamReader.ReadToEnd())
$objStreamReader.Close()
$objStreamReader.Dispose()
$objFileStream.Close()
$objFileStream.Dispose()
[string[]]$arrFileContent = $strFileContent -split("`r`n")