php正则表达式以获取base64字符串



我有一个文件smime.p7m,里面有很多内容。其中一个或多个内容类似于此

--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"
JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--

如果是pDF和下面的BASE64代码,有没有办法获得文件名?例如,用regex?文件中可能存在多个PDF文件。

文件名不是问题所在。我用";filename=";(.*(.pdf";。但我不知道如何获得文件名之后的base64代码

base64由字符A…Z A…Z数字0…9符号+/组成。它最后也可以有一个或两个=,并且可以拆分为几行。

if (preg_match('/filename="(?P<filename>[^"]*?.pdf)"s*(?P<base64>([A-Za-z0-9+/]+s*)+=?=?)/', $s, $regres)) {
print("FileName: {$regres['filename']}n");
print("Base64: {$regres['base64']}n");
}

使用

(?im)^filename="([^"]*.pdf)"R+(.+(?:R.+)+)

查看验证

PHP

preg_match_all('/^filename="([^"]*.pdf)"R+(.+(?:R.+)+)/im', $str, $matches);

解释

--------------------------------------------------------------------------------
(?im)                    set flags for this block (case-
insensitive) (with ^ and $ matching start
and end of line) (with . not matching n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^                        the beginning of a "line"
--------------------------------------------------------------------------------
filename="               'filename="'
--------------------------------------------------------------------------------
(                        group and capture to 1:
--------------------------------------------------------------------------------
[^"]*                    any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
.                       '.'
--------------------------------------------------------------------------------
pdf                      'pdf'
--------------------------------------------------------------------------------
)                        end of 1
--------------------------------------------------------------------------------
"                        '"'
--------------------------------------------------------------------------------
R+                      any line break sequence (1 or more times (matching 
the most  amount possible))
--------------------------------------------------------------------------------
(                        group and capture to 2:
--------------------------------------------------------------------------------
.+                       any character except n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?:                      group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
R                       any line break sequence
--------------------------------------------------------------------------------
.+                       any character except n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)+                       end of grouping
--------------------------------------------------------------------------------
)                        end of 2

我认为这项任务根本不是关于验证的,而是只关注数据提取——这使得锐化正则表达式逻辑变得不必要。

您只需要一个在行首匹配filename="的模式,然后捕获带引号的子字符串(只要它以.pdf结尾(,然后在任意数量的空白字符之后,捕获所有字符,直到遇到一个或两个=

使用贪婪的负字符类可以使正则表达式引擎快速移动。m模式修改器告诉正则表达式引擎,除了字符串的开头之外,^元字符(而不是方括号内使用的^(还可以匹配行的开头。

也许您想生成一个关联数组,其中键是文件名字符串,编码字符串是值,array_column()在有符合条件的匹配时可以快速设置。

代码:(演示(

var_export(
preg_match_all(
'~^filename="([^"]+).pdf"s*([^=]+={1,2})~m',
$fileContents,
$out,
PREG_SET_ORDER
)
? array_column($out, 2, 1)
: "no pdf's found"
);

输出:

array (
'001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)

最新更新