Powershell 返回在字符串模式中找到的最高 4 位数字 - 搜索 Word 文档



我正在尝试返回一组文档中字符串模式中最高的 4 位数字。

字符串模式:3 个字母破折号 4 位数字

Word 文档包含如下所示的文档标识符代码。

示例文件:

汽车零件.docx> CPW - 2345

汽车把手.docx> CPW - 8723

车列表.docx>注册会计师 - 9083

我已经引用了我正在尝试改编的示例代码。我不是 VBA 或 powershell 程序员 - 所以我可能想做的事情是错误的?

我很高兴在Windows平台上寻找替代方案。

我已经引用了这个来让我开始

http://chris-nullpayload.rhcloud.com/2012/07/find-and-replace-string-in-all-docx-files-recursively/

PowerShell:返回在搜索模式的文件中找到的实例数

Powershell:返回具有最高编号的文件名

$list = gci "C:UsersWPDesktopSearchFiles" -Include *.docx -Force -recurse
foreach ($foo in $list) {
$objWord = New-Object -ComObject word.application
$objWord.Visible = $False
$objDoc = $objWord.Documents.Open("$foo")
$objSelection = $objWord.Selection 
$Pat1 = [regex]'[A-Z]{3}-[0-9]{4}'   # Find the regex match 3 letters  followed by 4 numbers eg     HGW - 1024
$findtext= "$Pat1"
 $highestNumber = 
 # Find the highest occurrence of this pattern found in the documents searched - output to text file or on screen
Sort-Object |                   # This may also be wrong -I added it for when I find the pattern
Select-Object -Last 1 -ExpandProperty Name

<#   The below may not be needed  - ?
$ReplaceText = ""
$ReplaceAll = 2
$FindContinue = 1
$MatchFuzzy = $False
$MatchCase = $False
$MatchPhrase = $false
$MatchWholeWord = $True
$MatchWildcards = $True
$MatchSoundsLike = $False
$MatchAllWordForms = $False
$Forward = $True
$Wrap = $FindContinue
$Format = $False
$objSelection.Find.execute(
    $FindText,
    $MatchCase,
    $MatchWholeWord,
    $MatchWildcards,
    $MatchSoundsLike,
    $MatchAllWordForms,
    $Forward,
    $Wrap,
    $Format,
    $ReplaceText,
    $ReplaceAll
  }
}
#>

我感谢任何关于如何进行的建议 -

试试这个:

# This library is needed to extact zip archives. A .docx is a zip archive
# .NET 4.5 or later is requried
Add-Type -AssemblyName System.IO.Compression.FileSystem
# This function gets plain text from a word document
# adapted from http://stackoverflow.com/a/19503654/284111
# It is not ideal, but good enough
function Extract-Text([string]$fileName) {
  #Generate random temporary file name for text extaction from .docx
  $tempFileName = [Guid]::NewGuid().Guid
  #Extract document xml into a variable ($text)
  $entry = [System.IO.Compression.ZipFile]::OpenRead($fileName).GetEntry("word/document.xml")
  [System.IO.Compression.ZipFileExtensions]::ExtractToFile($entry,$tempFileName)
  $text = [System.IO.File]::ReadAllText($tempFileName)
  Remove-Item $tempFileName
  #Remove actual xml tags and leave the text behind
  $text = $text -replace '</w:r></w:p></w:tc><w:tc>', " "
  $text = $text -replace '</w:r></w:p>', "`r`n"
  $text = $text -replace "<[^>]*>",""
  return $text
}
$fileList = Get-ChildItem "C:UsersWPDesktopSearchFiles" -Include *.docx -Force -recurse
# Adapted from http://stackoverflow.com/a/36023783/284111
$fileList | 
  Foreach-Object {[regex]::matches((Extract-Text $_), '(?<=[A-Za-z]{3}s*(?:-|–)s*)d{4}')} | 
  Select-Object -ExpandProperty captures | 
  Sort-Object value -Descending | 
  Select-Object -First 1 -ExpandProperty value 

这背后的主要思想不是在Word的COM api周围胡思乱想,而只是尝试手动从文档中提取文本信息。

获得最高数字的方法是首先使用正则表达式将其隔离,然后排序并选择第一项。像这样:

[regex]::matches($objSelection, '(?<=[A-Z]{3}s*-s*)d{4}')  `
  | Select -ExpandProperty captures `
  | sort value -Descending `
  | Select -First 1 -ExpandProperty value `
  | Add-Content outfile.txt

我认为您在使用正则表达式时遇到的问题是您的示例数据在代码中的破折号周围包含空格,而这些空格在您的模式中是不允许的。

最新更新