确定文件是否为PDF的方法更快



寻找一些指针/提示来提高以下内容的速度和/或效果。将对其他方法开放,但只涉足powershell、cmd和python。

信用到期时也要信用:这是一个关于以下内容的破解工作:https://stackoverflow.com/a/44183234/12834479

我没有在本地工作,而是通过VPN进行网络共享,连接速度非常糟糕。大致来说,它的工作速度是8秒/PDF。

我试图解决的问题,目标是确保每个PDF都能被Adobe读取。保存为PDF(但不是PDF(的图像会在一些PDF软件中打开,但Adobe讨厌它们。我有转换的方法,但我的限速器正在识别它们。

  • Adobe PDF-从%PDF开始
  • 一些银行PDF-以";空白空间";然后%PDF
  • 第三方软件-垃圾邮件头,但文档中包含%PDF
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}
$arrary = @()
$logFile = "RESULTS_$(get-date -Format yyyymmdd).log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
trap { Write-Output "Error trapped: $_"; continue; }
try {
$pdfText = Get-Content $item -raw
$ptr3 = '%PDF'
if ('%PDF' -ne $pdfText.SubString(([System.Math]::Max(0,$pdfText.IndexOf($ptr3))),4)) { $arrary+= "$item |-failed" >>$logfile;$badCounter += 1; $badCounter} else { $goodCounter += 1; $goodCounter}
continue;}
catch [System.Exception]{write-output "$item $_";}}
$totalCounter = $badCounter + $goodCounter
Write-Output $arrary >> $logFile
1..3 | %{ Write-Output "" >> $logFile }
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"

如果当前在PS版本7.1.3/中运行任何差异,但本地也有5.1.18。

实际上,PDF文件根本不是纯文本文件,而是二进制文件,因此不应将它们作为string读取
您要查找的是文件中的FourCC幻数。这四个字符的代码可以看作是识别文件类型的幻数。对于PDF文件,这4个字节是0x25, 0x50, 0x44, 0x46("%PDF"(,文件应该以这些字节开始。

对于那些真正的PDF文件,您可以使用进行测试

[byte[]]$fourCC = Get-Content -Encoding Byte -ReadCount 4 -TotalCount 4 -Path 'X:TheFile.pdf'
if ([System.Text.Encoding]::ASCII.GetString($fourCC) -ceq '%PDF') {
Write-Host "This is a true PDF file"
}

然而,正如你所说的"银行pdf通常以空格"开始,也考虑那些文件"em>";良好";,你可以做:

[byte[]]$sixCC = Get-Content -Encoding Byte -ReadCount 6 -TotalCount 6 -Path 'X:TheFile.pdf'
if ([System.Text.Encoding]::ASCII.GetString($sixCC) -cmatch '%PDF') {
Write-Host "This is a PDF file"
}

如果您还想处理其中"%PDF";在文件的任何地方都可以找到";良好";,您需要将整个文件作为字符串读取,但需要使用字节的一对一字节映射。为此,您可以使用以下辅助功能:

function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding     = [Text.Encoding]::GetEncoding(28591)
$Stream       = [System.IO.FileStream]::new($Path, 'Open', 'Read')
$StreamReader = [System.IO.StreamReader]::new($Stream, $Encoding)
$BinaryText   = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}

接下来,您可以将该函数用作:

$binString = ConvertTo-BinaryString -Path 'X:TheFile.pdf'
if ($binString.IndexOf("%PDF") -ge 0) {
Write-Host "This is a PDF file"
}

将其放在一起,并假设您想要所有文件标记为.PDF文件,其中幻数"%PDF"(区分大小写(可以在文件中的任何位置找到:

function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding     = [Text.Encoding]::GetEncoding(28591)
$Stream       = [System.IO.FileStream]::new($Path, 'Open', 'Read')
$StreamReader = [System.IO.StreamReader]::new($Stream, $Encoding)
$BinaryText   = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
$badCounter  = 0
$goodCounter = 0
$logFile     = "RESULTS_{0:yyyyMMdd}.log" -f (Get-Date)
# get an array of pdf file FullNames
$files = @(Get-ChildItem -File -Filter '*.pdf').FullName
Write-Host "Processing $($files.Count) files... " -ForegroundColor Yellow
# loop through the array, test if '%PDF' is found and output strings for the log file
$result = foreach ($item in $files) {
$pdfText = ConvertTo-BinaryString -Path $item
if ($pdfText.IndexOf("%PDF") -ge 0) {
$goodCounter++
"Success - $item"
}
else {
$badCounter++
"Fail - $item"
}
}
# write the output to the log file
$result | Set-Content -Path $logFile
"=" * 25 | Add-Content -Path $logFile
"BAD:   $badCounter"  | Add-Content -Path $logFile
"GOOD:  $goodCounter" | Add-Content -Path $logFile
"Total: $($files.Count)" | Add-Content -Path $logFile
Write-Host "DONE!" -ForegroundColor Green

相关内容

最新更新