PHP 替换所有匹配的标签和转义不匹配的文本



我有一些来自用户的HTML,其中可能包含一些标签。我想做两件事,首先得到我所有需要的标签&用新标签替换它们,并对用户的其他剩余文本进行第二次转义。

例如,假设这是来自我的用户的HTML,其中有一些我需要的标签

Hello! This is John. I have attached image <img class="someImg" unic="img.jpg">. And Visit this URL example.com
Yes! some injection like <script>alert('Hello');</script>

我想获得完整的<img>标签,其中包含属性,并用我自己的标签替换它。同时获取它包含的任何URL。其余的所有文本都应该进行转义和消毒。

我发现了一个函数preg_replace_callback_array,它在regex中搜索img和url并进行替换,效果很好,但唯一的问题是,它对那些不匹配任何regex的函数并没有任何作用。

$patterns = [
"/(http(s)?://)?([a-z]*.)?[-a-zA-Z0-9@:%._+~#=]{2,256}.[a-z]{2,6}b([-a-zA-Z0-9@:%_+.~#?&//=]*)/" => functionforFoundURL,
"/<img [^>]*class="[^"]*someImgb[^"]*"[^>]* unic="(.*?)"[^>]*>/" => 'functionforFoundImg',
];
preg_replace_callback_array($patterns, $html)

它运行我的函数来替换Img&URL,但不包含应转义的剩余文本。

如果在用新标签替换标签后,用htmlspecialchars转义preg_replace_callback_array的结果,那么它将转义整个字符串。

在这个问题中,有一个答案建议不要将正则表达式与HTML一起使用。我不确定,也不太了解这件事。因此,我将使用PHP DOMDocument,因为它更容易。

代码

$string = 'Hello! This is John. I have attached image <img class="someImg" unic="img.jpg">. And Visit this URL example.com
Image with on events. <img class="someImg" unic="img.jpg" onclick="alert('no!');" onload="console.log('img load');">
Yes! some injection like <script>alert('Hello');</script>. And child item like <div>div<p>paragraph</p></div>
<p>The end</p>';
$Dom = new DOMDocument();
// load HTML with custom `<body>` to prevent it auto wrap with `<p>`. Or in the end you can change from remove `<body>` to `<p>` if you don't use this.
$Dom->loadHTML('<body>' . $string . '</body>', LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$foundUrls = [];
grabUrlsAndSanitize($Dom, $foundUrls);
// get the result and remove custom `<body>`, `</body>`.
$output = str_replace(['<body>', '</body>'], '', $Dom->saveHTML());
echo '<strong>output:</strong> ' . htmlspecialchars($output, ENT_QUOTES);
echo '<br>' . PHP_EOL;
echo '<strong>found URLs:</strong> <pre>' . print_r($foundUrls, true) . '</pre>';

/**
* Grab URLs and sanitize scripting such as script or events.
*
* @param DOMNode $DomNode
* @param array $foundUrls
* @return void
*/
function grabUrlsAndSanitize(DOMNode $DomNode, &$foundUrls)
{
// sanitize all <script> tags.
foreach ($DomNode->getElementsByTagName('script') as $script) {
$script->parentNode->removeChild($script);
}
foreach ($DomNode->childNodes as $Node) {
if ($Node->nodeType === XML_ELEMENT_NODE) {
if ($Node->hasAttributes()) {
// if element contains attributes.
for ($i = $Node->attributes->length - 1; $i >= 0; --$i) {
$attribute = $Node->attributes->item($i);
if (is_object($attribute) && property_exists($attribute, 'name')) {
// sanitize on event such as onload onclick etc.
if (preg_match('/on([a-z]+)/iu', $attribute->name)) {
$Node->removeAttributeNode($Node->attributes->item($i));
}
// grab url from `unic` attribute.
if (strtolower($attribute->name) === 'unic') {
// if found unic="..."
$foundUrls[] = $attribute->value;
}
}
unset($attribute);
}
}
if ($Node->hasChildNodes()) {
grabUrlsAndSanitize($Node, $foundUrls);
}
} else {
continue;
}
}
}

结果:

output: Hello! This is John. I have attached image <img class="someImg" unic="img.jpg">. And Visit this URL example.com
Image with on events. <img class="someImg" unic="img.jpg">
Yes! some injection like . And child item like <div>div<p>paragraph</p></div>
<p>The end</p>
found URLs:
Array
(
[0] => img.jpg
[1] => img.jpg
)

安全注意事项

使用上面的函数来净化脚本以防止XSS并不是100%安全的。还有很多事情要做。请在OWASP、XSS过滤器规避备忘单上阅读更多关于此的信息。

为了更好地预防XSS,我建议使用HTML净化器、HTML消毒器或寻找其他类似的东西。

最新更新