POWERSHELL-在文件中忽略标题行(第一行)和页脚行(最后一行)时遇到困难

我希望按行中的一行中找到我的文件中的额外定界符。但是，我想忽略文件中的标题行（第一行）和页脚行（最后一行），只关注文件详细信息。

我不确定如何使用ReadLine()方法忽略第一行和最后一行。我不想以任何方式更改文件，此脚本仅用于识别具有额外分配程序的CSV文件中的行。

请注意：我要搜索的文件有数百万行，为了做到这一点，我必须依靠ReadLine()方法而不是Get-Content方法。

我确实尝试在我的Get-Content语句中使用Select-Object -Skip 1 | Select-Object -SkipLast 1，将值输入到$measure中，但我没有得到所需的结果。

例如：

H|Transaction|2017-10-03 12:00:00|Vendor --> This is the Header
D|918a39230a098134|2017-08-31 00:00:00.000|2017-08-15 00:00:00.000|SLICK-2340|...
D|918g39230b095134|2017-08-31 00:00:00.000|2017-08-15 00:00:00.000|EX|SRE-68|...
T|1268698 Records --> This is Footer

基本上，我希望我的脚本忽略标题和页脚，并使用第一个数据行（D|918...）作为正确记录的示例，而其他详细信息记录进行了比较（在此示例中，第二个详细行应该返回，因为该字段中有无效的定界符（EX|SRE-68...）。

当我尝试在get-content语句中使用-skip 1和-skiplast 1时，该过程仍在使用标题行作为比较，并将所有详细记录返回为无效记录。

这是我到目前为止所拥有的...

^{编辑注：尽管有意图，但此代码确实使用 header 行（第一行）来确定参考列计数。}

$File = "test.csv"
$Delimiter = "|"
$measure = Get-Content -Path $File | Measure-Object
$lines = $measure.Count
Write-Host "$File has ${lines} rows."
$i = 1
$reader = [System.IO.File]::OpenText($File)
$line = $reader.ReadLine()
$reader.Close()
$header = $line.Split($Delimiter).Count
$reader = [System.IO.File]::OpenText($File)
try
{
    for()
    {
        $line = $reader.ReadLine()
        if($line -eq $null) { break }
        $c = $line.Split($Delimiter).Count
        if($c -ne $header -and $i -ne${lines})
        {
            Write-Host "$File - Line $i has $c fields, but it should be $header"
        }
        $i++
    }
}
finally
{
    $reader.Close()
}

您使用读行的任何原因吗？您的处理方法已经将整个CSV加载到内存中，因此我将其保存到变量中，然后使用循环进行（从1开始跳过第一行）。

这样的东西：

$File = "test.csv"
$Delimiter = "|"
$contents = Get-Content -Path $File
$lines = $contents.Count
Write-Host "$File has ${lines} rows."
$header = $contents[0].Split($Delimiter).count
for ($i = 1; $i -lt ($lines - 1); $i++)
{ 
    $c = $contents[$i].Split($Delimiter).Count
    if($c -ne $header)
    {
        Write-Host "$File - Line $i has $c fields, but it should be $header"
    }
}

现在我们知道性能很重要，这是一个仅使用 [System.IO.TextFile].ReadLine()（作为Get-Content的更快替代品）来读取大型输入文件，并且确实可以使用因此，只有一次：

通过Get-Content ... | Measure-Object，
没有单独打开文件的实例，只是为了读取标题行；在阅读标头线后保持文件打开的额外优势，您可以继续阅读（跳过标题线不需要逻辑）。

$File = "test.csv"
$Delimiter = "|"
# Open the CSV file as a text file for line-based reading.
$reader = [System.IO.File]::OpenText($File)
# Read the lines.
try {
  # Read the header line and discard it.
  $null = $reader.ReadLine()
  # Read the first data line - the reference line - and count its columns.
  $refColCount = $reader.ReadLine().Split($Delimiter).Count
  # Read the remaining lines in a loop, skipping the final line.
  $i = 2 # initialize the line number to 2, given that we've already read the header and the first data line.
  while ($null -ne ($line = $reader.ReadLine())) { # $null indicates EOF
    ++$i # increment line number
    # If we're now at EOF, we've just read the last line - the footer - 
    # which we want to ignore, so we exit the loop here.
    if ($reader.EndOfStream) { break }
    # Count this line's columns and warn, if the count differs from the
    # header line's.
    if (($colCount = $line.Split($Delimiter).Count) -ne $refColCount) {
      Write-Warning "$File - Line $i has $colCount fields rather than the expected $refColCount."
    }
  } 
} finally {
  $reader.Close()
}

^{注意：此答案是在OP澄清性能是至关重要的，并且基于Get-Content的解决方案不是一个选择。我的另一个答案现在解决了。
此答案对于慢> 可能仍然很感兴趣，但是更简洁的是，PowerShell-Idiomatic解决方案。}

the_sw的有用答案表明，您可以使用PowerShell自己的Get-Content CMDLET来方便地读取文件，而无需求助于直接使用.NET Framework。

psv5 启用惯用的单pipeline解决方案更简洁，更有效的内存效率 - 它一对一地处理 - 尽管性能;但是，尤其是对于大文件，您可能不想一次读取它们，因此最好是管道解决方案。

^{PSV5 由于使用Select-Object S -SkipLast参数。}

$File = "test.csv"
$Delimiter = '|'
Get-Content $File | Select-Object -SkipLast 1 | ForEach-Object { $i = 0 } {
  if (++$i -eq 1) { 
    return # ignore the actual header row
  } elseif ($i -eq 2) { # reference row
    $refColumnCount = $_.Split($Delimiter).Count
  } else { # remaining rows, except the footer, thanks to -SkipLast 1
    $columnCount = $_.Split($Delimiter).Count
    if ($columnCount -ne $refColumnCount) {
      "$File - Line $i has $columnCount fields rather than the expected $refColumnCount."
    }
  }
}

相关内容

最新更新

热门标签：