恶意代码注入:通过regex删除脚本标记是否足够安全

所以我建立了一个页面，人们可以在这里提交教程。这些教程基本上是由TinyMCE编辑器构建的。

无论如何，人们可以滥用它，只发布自己的非转义文本，并插入一些恶意的<script>。

所以我的问题是：用正则表达式删除<script>标签是否足够安全？在存储之前，我会在后台运行这个正则表达式

我发现了这个表达式，例如

<scriptb[^<]*(?:(?!</script>)<[^<]*)*</script>

否。他们可能会使用多字节字符来绕过您的regexp，或者偷偷地使用不匹配的开始和结束标记的组合，创建虚假的结束脚本标记，在属性中引用它们，等等…不要试图用RegEx解析潜在的嘈杂/格式错误的HTML，请使用专为处理这些问题而设计的HTML解析引擎。关于用正则表达式解析HTML的著名答案如下：regex匹配除XHTML自包含标记之外的开放标记

如果你正在寻找一个，我发誓这个PHP库：http://simplehtmldom.sourceforge.net/
它首先通过将噪声转换为实体来清理文档，然后再考虑"噪声"；脚本""风格"；，以及"；文本区域"；元素，在打开和关闭标记之间找到的任何内容都是文本而不是HTML。然后，它将结果解析为DOM结构，可以像在JavaScript中使用DOM方法解析文档一样进行解析。它带有一个"；保存"；方法，（这将产生字符串），所以在剥离页面中的标记后，您将得到修改后的格式良好的文档。我也用大数据测试过这个库，当我之前用大数据使用regexp时，由于regexp达到了PHP内存限制，所以它失败了，这个库解析这些文档时没有内存问题。因此，我已经对它进行了非常彻底的测试，并在大型项目中使用过它，它从未让我失望——就像内置的PHP函数/类具有格式错误的数据一样。

编辑：下面是一个如何打破它的例子：

<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>

仅仅因为jQuery使用了regex，并不能保证它对服务器的安全。

即使您使用"；gi"；标志，无所谓：

var str="<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>";
str=str.replace(/<scriptb[^<]*(?:(?!</script>)<[^<]*)*</script>/gi,'');
//the "g" flag doesn't help here since you need to start from the beginning, not continue in the middle
alert(str);

但如果你在循环中使用它，而不是用"；g"；flag，你会处理掉我提起的这个案子。

编辑2:如果目的是从所有JavaScript问题中清除用户输入；"加载"；以及"；onclick"；属性，为什么要重新发明轮子？有http://htmlpurifier.org/（参见演示）

为什么不使用DOM来代替regex呢？

$content = "<h1>title</h1><p> test <span>1<!-- regular comment --><script> my script</script></span><script> my script</script></p><script> my script</script> <!--[if IE]><script>alert('XSS');</script><![endif]-->";
// creates a DOMDocument based on your string (without doctype, html and another extra tags), and wraps it in a div
$dom = new DOMDocument();
$dom->loadHTML("<div>{$content}</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
//Removing any comments or conditional comments
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}
// function to remove any tag
function verifyNodes(DOMNode $node) {
    $removedTags = ['script', 'iframe']; // what tags i want to remove
    foreach ($node->childNodes as $childNode)
    {
        if (in_array($childNode->nodeName, $removedTags)) {
            $childNode->parentNode->removeChild($childNode);
        } elseif ($childNode->hasChildNodes()) {
            verifyNodes($childNode);
        }
    }
}
// calling verifyNodes
verifyNodes($dom);
// get all the content of my first div, and print it
$newContent = $dom->getElementsByTagName('div')->item(0);
foreach ($newContent->childNodes as $childNode) {
    var_dump($dom->saveHTML($childNode));
}

就像我使用nodeName来验证标记的名称一样，如果我们想删除其他内容（查看节点XML常量列表），我们也可以使用nodeType。

如果您可以使用支持原子组的引擎，这可能会
工作这将最接近于浏览器如何解析脚本
标签。

查找：
(?><script(?:(?:s+(?:"[Ss]*?"|'[Ss]*?'|[^>]*?)+)|/)>)(?<=/>)|(?><script(?:s+(?:"[Ss]*?"|'[Ss]*?'|[^>]*?)+)?>)(?<!/>)[Ss]*?</scripts*>

替换：空字符串

格式化：

    # If script tags can be <script .... />
    (?>
         <
         script 
         (?:
              (?:
                   s+ 
                   (?: " [Ss]*? " | ' [Ss]*? ' | [^>]*? )+
              )
           |  / 
         )
         > 
    )
    (?<= /> )
 |  
    # Or, if script tags with content can be <script .... > ... </script>
    (?>
         <
         script 
         (?:
              s+ 
              (?: " [Ss]*? " | ' [Ss]*? ' | [^>]*? )+
         )?
         > 
    )
    (?<! /> )
    [Ss]*? 
    </script s* >

相关内容

最新更新

热门标签：