目的:搜索数万个中文句子的数组,以找到仅包含"已知字符"数组中的字符的句子。
例如:假设我的语料库由以下句子组成:1)我去中国。2) 妳爱他。3) 你在哪里?我只"知道"或想要专门包含这些字符的句子:1) 我 2) 中 3) 国 4) 你 5) 在 6) 去 7) 爱 8) 哪 9) 里。第一个句子将作为结果返回,因为它的所有三个字符都在我的第二个数组中。第二句话会被拒绝,因为我没有要求妳或他。因此,将返回第三句。标点符号(以及任何字母数字字符)将被忽略。
我有一个执行此操作的工作脚本(如下)。我想知道这是否是一种有效的方法。如果您有兴趣,请看一下并提出更改建议,编写自己的更改或提供一些建议。我从这个脚本中收集了一些,并检查了一些堆栈溢出问题,但它们没有解决这种情况。
<?php
$known_characters = parse_file("FILENAME") // retrieves target characters
$sentences = parse_csv("FILENAME"); // retrieves the text corpus
$number_wanted = 30; // number of sentences to attempt to retrieve
$found = array(); // stores results
$number_found = 0; // number of results
$character_known = false; // assume character is not known
$sentence_known = true; // assume sentence matches target characters
foreach ($sentences as $s) {
// retrieves an array of the sentence
$sentence_characters = mb_str_split($s->ttext);
foreach ($sentence_characters as $sc) {
// check to see if the character is alpha-numeric or punctuation
// if so, then ignore.
$pattern = '/[a-zA-Z0-9sx{3000}-x{303F}x{FF00}-x{FF5A}]/u';
if (!preg_match($pattern, $sc)) {
foreach ($known_characters as $kc) {;
if ($sc==$kc) {
// if character is known, move to next character
$character_known = true;
break;
}
}
} else {
// character is known if it is alpha-numeric or punctuation
$character_known = true;
}
if (!$character_known) {
// if character is unknown, move to next sentence
$sentence_known = false;
break;
}
$character_known = false; // reset for next iteration
}
if ($sentence_known) {
// if sentence is known, add it to results array
$found[] = $s->ttext;
$number_found = $number_found+1;
}
if ($number_found==$number_wanted)
break; // if required number of results are found, break
$sentence_known = true; // reset for next iteration
}
?>
这应该这样做:
$pattern = '/[^a-zA-Z0-9sx{3000}-x{303F}x{FF00}-x{FF5A}我中国你在去爱哪里]/u';
if (preg_match($pattern, $sentence) {
// the sentence contains characters besides a-zA-Z0-9, punctuation
// and the selected characters
} else {
// the sentence contains only the allowed characters
}
确保以 UTF-8 格式保存源代码文件。