我使用以下代码搜索和突出显示重音文本。我面临的问题是,它删除重音文本,而突出显示。有办法保留口音吗?
echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");
function highlightTerm($text, $keyword) {
$text = iconv('utf-8', 'ISO-8859-1//IGNORE', Normalizer::normalize($text, Normalizer::FORM_D));
$words = explode(" ", $keyword);
$p = implode('|', array_map('preg_quote', $words));
return preg_replace(
"/($p)/ui",
'<span style="background:yellow;">$1</span>',
$text
);
}
简单的替换将无法解决此问题。您必须将文本分成单词,并比较规范化的单词。您应该使用DOM来迭代和替换文本节点。这避免了替换其他节点类型(属性、注释等)中的术语,并处理了转义。
拆分可以用正则表达式完成,但是在ext/intl
扩展中有一个特定的工具叫做IntlBreakIterator
。扩展也有一个Collator
用于字符串比较。
下面是一个完整单词的例子:
$html = <<<'HTML'
<div>
Would you like a café, Mister Kàpêk?
</div>
HTML;
// prepare the text breaker
$breaker = IntlBreakIterator::createWordInstance('en_US');
// prepare the compare
$collator = new Collator('en_US');
$collator->setStrength(Collator::PRIMARY);
// wrap terms for easy use
$terms = new Terms(
function($word) use ($collator) {
return $collator->getSortKey($word);
},
'cafe',
'kapek'
);
// load HTML fragment into DOM
$document = new DOMDocument();
$document->loadHTML(
"<?xml encoding='UTF-8'?>n$html"
);
$xpath = new DOMXpath($document);
// iterate text nodes
foreach ($xpath->evaluate('//text()') as $textNode) {
// feed text into word breaker
$breaker->setText($textNode->textContent);
// prepare a fragment for new nodes
$fragment = $document->createDocumentFragment();
$replace = false;
// iterate words
foreach ($breaker->getPartsIterator() as $word) {
// find word in terms
$index = $terms->indexOf($word) + 1;
if ($index > 0) {
$replace = true;
// wrap in a "span" element
$span = $document->createElement('span');
$span->textContent = $word;
$span->setAttribute('class', 'term');
$span->setAttribute('data-term-index', $index);
$fragment->appendChild($span);
} else {
$fragment->appendChild($document->createTextNode($word));
}
}
if ($replace) {
// replace original text node with new fragment
$textNode->parentNode->replaceChild($fragment, $textNode);
}
}
// DOMDocument::loadHTML() will have wrapped the HTML to
// create a whole document
$result = '';
foreach ($xpath->evaluate('//body/node()') as $node) {
$result .= $document->saveHTML($node);
}
echo $result;
class Terms {
private $_normalize;
private $_hashes;
public function __construct(
callable $normalize,
string ...$terms
) {
$this->_normalize = $normalize;
$this->_hashes = array_flip(
array_map(
function(string $term): string {
$normalize = $this->_normalize;
return $normalize($term);
},
$terms
)
);
}
public function indexOf(string $word): int {
$normalize = $this->_normalize;
$hash = $normalize($word);
return $this->_hashes[$hash] ?? -1;
}
}
输出:
<div>
Would you like a <span class="term" data-term-index="1">café</span>, Mister <span class="term" data-term-index="2">Kàpêk</span>?
</div>
将其扩展到部分匹配是可能的,但它可能会变得复杂。您必须简化当前单词(并跟踪其位置),直到它匹配一个术语,然后构建一个输出片段。
这是一种不太漂亮的方法,它在规范化输入字符串中隔离搜索词,然后根据匹配的偏移量和子字符串的长度对原始字符串执行多字节安全的操作。
我用默认情况下preg_quote()
将转义的符号替换了模式分隔符。
必须反向替换,这样偏移量和长度计算才不会倾斜。
通常这类任务调用preg_replace_callback()
,但由于搜索是在规范化字符串上,替换是在原始字符串上,所以替换步骤必须与匹配步骤分开。
我使用strtr()
来强制规范化,因为我不是很清楚规范化重音字符的最可靠的方法。请随意替换该子过程。
代码(演示):
define(
'ACCENT_MAP',
[
"ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
"А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
"Б" => "B", "ב" => "B", "Þ" => "B",
"Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
"Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
"È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
"Ф" => "F", "Ƒ" => "F",
"Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
"ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
"I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
"Й" => "J", "Ĵ" => "J",
"ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
"Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
"מ" => "M", "М" => "M", "ם" => "M",
"Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ʼn" => "N", "Ň" => "N",
"Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
"פ" => "P", "ף" => "P", "П" => "P",
"ק" => "Q",
"Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
"Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
"Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
"Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
"В" => "V", "ו" => "V",
"Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
"Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
"а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
"б" => "b", "ב" => "b", "þ" => "b",
"ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
"Ч" => "ch", "ч" => "ch",
"д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
"è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
"ф" => "f", "ƒ" => "f",
"ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
"ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
"i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
"й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
"ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
"ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
"מ" => "m", "м" => "m", "ם" => "m",
"ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ʼn" => "n", "ň" => "n",
"ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
"פ" => "p", "ף" => "p", "п" => "p",
"ק" => "q",
"ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
"ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
"т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
"ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
"в" => "v", "ו" => "v",
"ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
"ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
"™" => "tm",
"@" => "at",
"Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
"ij" => "ij", "IJ" => "ij",
"я" => "ja", "Я" => "ja",
"Э" => "je", "э" => "je",
"ё" => "jo", "Ё" => "jo",
"ю" => "ju", "Ю" => "ju",
"œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
"щ" => "sch", "Щ" => "sch",
"ш" => "sh", "Ш" => "sh",
"ß" => "ss",
"Ü" => "ue",
"Ж" => "zh", "ж" => "zh",
]);
:
function highlightTerm($text, $keyword) {
$mbLength = mb_strlen($text);
$unaccented = strtr($text, ACCENT_MAP);
$words = explode(" ", $keyword);
$regex = implode('|', array_map('preg_quote', $words));
if (preg_match_all("#$regex#ui", $unaccented, $m, PREG_OFFSET_CAPTURE)) {
foreach (array_reverse($m[0]) as [$match, $offset]) {
// normalized length
$length = strlen($match);
// new multibyte-safe substring
$tag = '<span style="background:yellow;">'
. mb_substr($text, $offset, $length)
. '</span>';
// actual multibyte-safe replacement on original text
$text = mb_substr($text, 0, $offset)
. $tag
. mb_substr($text, $offset + $length);
}
}
return $text;
}
echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");
输出:
Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?
不需要规范化文本,您可以使用冗长的方法创建一个动态的、与重音无关的正则表达式模式,然后直接对输入字符串执行替换。
正则表达式映射(基于此答案的第二个代码块):
define(
'ACCENT_MAP',
[
"A" => "[AАĂǍĄÀÃÁÆÂÅǺĀא]",
"B" => "[BБבÞ]",
"C" => "[CĈĆÇЦצĊČץ]",
"D" => "[DДĎĐדÐ]",
"E" => "[EÈĘÉËÊЕĒĖĚĔЄƏע]",
"F" => "[FФƑ]",
"G" => "[GĞĠĢĜГגҐ]",
"H" => "[HחĦХĤה]",
"I" => "[IIÏÎÍÌĮĬIИĨǏיЇĪІ]",
"J" => "[JЙĴ]",
"K" => "[KĸכĶКך]",
"L" => "[LŁĿЛĻĹĽל]",
"M" => "[MמМם]",
"N" => "[NÑŃНŅןŊנʼnŇ]",
"O" => "[OØÓÒÔÕОŐŎŌǾǑƠ]",
"P" => "[PפףП]",
"Q" => "[Qק]",
"R" => "[RŔŘŖרР]",
"S" => "[SŞŚȘŠСŜס]",
"T" => "[TТȚטŦתŤŢ]",
"U" => "[UÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ]",
"V" => "[VВו]",
"Y" => "[YÝЫŶŸ]",
"Z" => "(?:Z|ŹŽŻЗז",
"a" => "[aаăǎąàãáæâåǻāא]",
"b" => "[bбבþ]",
"c" => "[cĉćçцצċčץ]",
"ch" => "(?:ch|ч)",
"d" => "[dдďđדð]",
"e" => "[eèęéëêеēėěĕєəע]",
"f" => "[fфƒ]",
"g" => "[gğġģĝгגґ]",
"h" => "[hחħхĥה]",
"i" => "[iiïîíìįĭıиĩǐיїīі]",
"j" => "[jйĵ]",
"k" => "[kĸכķкך]",
"l" => "[lłŀлļĺľל]",
"m" => "[mמмם]",
"n" => "[nñńнņןŋנʼnň]",
"o" => "[oøóòôõоőŏōǿǒơ]",
"p" => "[pפףп]",
"q" => "[qק]",
"r" => "[rŕřŗרр]",
"s" => "[sşśșšсŝס]",
"t" => "[tтțטŧתťţ]",
"u" => "[uùûúūуũưǔųŭůűǖǜǚǘ]",
"v" => "[vвו]",
"y" => "[yýыŷÿ]",
"z" => "[zźžżзזſ]",
"ae" => "(?:ae|[ÄǼäæǽ])",
"ch" => "(?:ch|[Чч])",
"ij" => "(?:ij|[ijIJ])",
"ja" => "(?:ja|[яЯ])",
"je" => "(?:je|[Ээ])",
"jo" => "(?:jo|[ёЁ])",
"ju" => "(?:ju|[юЮ])",
"oe" => "(?:oe|[œŒöÖ])",
"sch" => "(?:sch|[щЩ])",
"sh" => "(?:sh|[шШ])",
"ss" => "(?:ss|[ß])",
"ue" => "(?:ue|[Ü)",
"zh" => "(?:zh|[Жж])"
]);
代码(演示):
function highlightTerm($text, $keyword) {
$regex = implode(
'|',
array_map(
fn($w) => strtr(preg_quote($w), ACCENT_MAP),
explode(" ", $keyword)
)
);
return preg_replace(
"#$regex#ui",
'<span style="background:yellow;">$0</span>',
$text
);
}
echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");
输出:
Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?