<?php
$badWords = array("ban","bad","user","pass","stack","name","html");
$string = "Hello my name is user.";
$matches = array();
$matchFound = preg_match_all(
"/b(" . implode($badWords,"|") . ")b/i",
$string,
$matches
);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
echo "<li>" . $word . "</li>";
}
echo "</ul>";
}
?>
但是当我把$badWords改成希伯来语:
$badWords = array("עזה","חמאס");
并将文本($string)更改为希伯来语:
$string = "חמאס רוצה להרוג אותנו ולא יצליח";
它不工作
为什么?
在英语中工作得很好!
你只需要通知regex引擎你正在使用的模式包含utf-8字符,你必须改变字符类w
和单词边界b
的含义来处理utf-8字符(因为默认情况下w
只包含ascii字母)。要做到这一点,有两种方法:
使用u修饰符
$matchFound = preg_match_all(
"/b(" . implode($badWords,"|") . ")b/iu",
$string,
$matches
);
或将(*UTF8)(*UCP)
放在模式的最开头:
$matchFound = preg_match_all(
"/(*UTF8)(*UCP)b(" . implode($badWords,"|") . ")b/i",
$string,
$matches
);
(*UTF8)
通知regex引擎模式字符串必须被视为utf8字符串。
(*UCP)
将w
从默认的[a-zA-Z0-9_]
更改为[p{L}p{N}_]