Powershell 7.x如何仅使用边界子字符串选择未知长度的文本子字符串



我正在尝试存储一个文本文件字符串,该字符串的开头和结尾使其成为原始文本文件的子字符串。我是Powershell的新手,所以我的方法很简单/粗糙。基本上,我的方法是:

  1. 从字符串的开头大致得到我想要的内容
  2. 担心以后修剪掉我不想要的东西

我的最小可复制示例如下:

# selectStringTest.ps    

$inputFile = Get-Content -Path "C:testtest3Copy of 31832_226140__0001-00006.txt"
#  selected text string needs to span from $refName up to $boundaryName 
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"
# a rough estimate of the text file lines required
[int]$lines = 200

if (Select-String  -InputObject $inputFile -pattern $refName) {
Write-Host "Selected shortened string found!"
# this selects the start of required string but with extra text
[string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines   
}
else {
Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')
# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)
$newFileStart | Out-File tempOutputFile

事实上:输出开始正确,但我无法删除包括$boundaryName及其后的文本

原始文本文件是OCR生成的(光学字符识别(,因此格式不均匀。奇怪的地方有换行符。因此,当涉及到定界时,我的选择是有限的。

我不确定我的if (Select-String -InputObject $inputFile -pattern $refName)是否有效。它似乎工作正常。总体设计似乎很粗糙。在这方面,我在猜测我需要多少行。最后,我尝试了各种方法来修剪$boundaryName的字符串,但都没有成功。为此:

  • string.split((不实用
  • 将数组中的空格替换为换行符&循环到$boundaryName的元素是可能的,但我不知道如何在将数组返回到字符串之前终止数组

如有任何建议,我们将不胜感激。

x2 200个列表单个Copy of 31832_226140__0001-00006.txt文件的缩写内容为:

文本文件的开头

________________
BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth

文本文件中间

............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........

文本文件结束

..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON

要跨换行使用regex,需要将文件作为单个字符串读取。Get-Content -Raw会这么做。这假设您不希望包含refName和boundaryName的行包含在输出中

$c = Get-Content -Path '.beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"
if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
$result = $Matches[1]
}
$result

更多信息,请访问https://stackoverflow.com/a/12573413/447901

这与您想要的有多接近?

function Process-File {
param (
[Parameter(Mandatory = $true, Position = 0)]
[string]$HeadText,
[Parameter(Mandatory = $true, Position = 1)]
[string]$TailText,
[Parameter(ValueFromPipeline)]
$File
)
Process {
$Inside = $false;
switch -Regex -File $File.FullName {
#'^s*$' { continue }
"(?i)^s*$TailText(?<Tail>.*)`$"    { $Matches.Tail; $Inside = $false }
'^(?<Line>.+)$'                     { if($Inside) { $Matches.Line } }
"(?i)^s*$HeadText(?<Head>.*)`$"    { $Matches.Head; $Inside = $true }
default { continue }
}
}
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:testtest3'
$Result = Get-ChildItem -Path "$Path$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$PathSpanText.txt"

这是输出:

. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........

最新更新