根据第一个字符划分单词



我有一个段落,我想根据第一个字符对每个单词进行分组,并按其他字符对分组。

示例文本:

$text = "Why end might ask civil again spoil. She dinner she our horses depend. Remember at children by reserved to vicinity. In affronting unreserved delightful simplicity ye. Law own advantage furniture continual sweetness bed agreeable perpetual. Oh song well four only head busy it. Afford son she had lively living. Tastes lovers myself too formal season our valley boy. Lived it their their walls might to by young.";

第一句话的预期结果-

为什么结束可能会再次要求民事破坏

a => again, ask
c => civil
e => end
m => might
s => spoil
w => Why

有很多方法可以做到这一点....我只是选择一个我觉得稍微有趣的(而不仅仅是"gimme 脚本";-))

<?php
// see http://docs.php.net/splheap
class StrcasecmpHeap extends SplHeap {
    protected function compare ($a,$b) { return strcasecmp($b,$a); }
}
$text = "Why end might ask civil again spoil. She dinner she our horses depend. Remember at children by reserved to vicinity. In affronting unreserved delightful simplicity ye. Law own advantage furniture continual sweetness bed agreeable perpetual. Oh song well four only head busy it. Afford son she had lively living. Tastes lovers myself too formal season our valley boy. Lived it their their walls might to by young.";
// create
$result = [];
// see http://docs.php.net/preg_split
foreach( preg_split('![^a-zA-Z]+!', $text, -1, PREG_SPLIT_NO_EMPTY) as $word ) {
    $char = strtolower($word[0]);
    if ( !isset($result[$char]) ) {
        $result[$char] = new StrcasecmpHeap;
    }
    $result[$char]->insert($word);
}
// print
foreach( $result as $char=>$list ) {
    echo "--- $char ---", PHP_EOL;
    foreach($list as $word ) {
        echo ' ', $word, PHP_EOL;
    }
}

这将保持双峰,例如

--- s ---
季节

简单

<?php
$text = "Why end might ask civil again spoil. She dinner she our horses depend. Remember at children by reserved to vicinity. In affronting unreserved delightful simplicity ye. Law own advantage furniture continual sweetness bed agreeable perpetual. Oh song well four only head busy it. Afford son she had lively living. Tastes lovers myself too formal season our valley boy. Lived it their their walls might to by young.";
// build
$result = [];
foreach( preg_split('![^a-zA-Z]+!', $text, -1, PREG_SPLIT_NO_EMPTY) as $word ) {
    // here goes the case-sensitivity; it's all lower-case from now on....
    $word = strtolower($word);
    $char = $word[0];
    // not storing as the element's value but the key
    // takes care of doublets
    $result[$char][$word] = true;
}
// get keys & sort
$result = array_map(
    function($e) {
        // remember? The actual words have been stored as the keys
        $e = array_keys($e);
        usort($e, 'strcasecmp');
        return $e;
    },
    $result
);

// print
var_export($result);

我的解决方案是围绕一个正则表达式构建的,该正则表达式将已排序的单词按首字母拆分为短语。

  • (w):匹配任何字母(技术上是任何"单词"字符)的捕获组,它匹配单词中的第一个字母,然后
  • .*? :尽可能少的字符数(可能仅来自一个单词,也可能来自多个单词),后跟
  • ($| (?!\1)) :文本的最后空格,后跟与初始捕获组相同的字母。
$text = "Why end might ask civil again spoil. She dinner she our horses"
    . " depend. Remember at children by reserved to vicinity. In affronting"
    . " unreserved delightful simplicity ye. Law own advantage furniture"
    . " continual sweetness bed agreeable perpetual. Oh song well four only"
    . " head busy it. Afford son she had lively living. Tastes lovers"
    . " myself too formal season our valley boy. Lived it their their walls"
    . " might to by young.";
// Split the text into individual words and sort them, case insensitively.
$words = preg_split("[W+]", $text);
natcasesort($words);
// Join the sorted words back together and break them into phrases by
// initial letter.
preg_match_all("[(w).*?($| (?!\1))]i", implode(" ", $words), $matches);
// Arrange the phrases into an array keyed by lower-case initial letter,
// split them back into an array of words.
$words = array_combine(
    array_map("strtolower", $matches[1]),
    array_map(function($phrase){ return explode(" ", trim($phrase)); },
              $matches[0]));
var_dump($words);
/*
array (size=19)
  'a' => 
    array (size=7)
      0 => string 'advantage' (length=9)
      1 => string 'Afford' (length=6)
      2 => string 'affronting' (length=10)
      3 => string 'again' (length=5)
      4 => string 'agreeable' (length=9)
      5 => string 'ask' (length=3)
      6 => string 'at' (length=2)
  'b' => 
    array (size=5)
      0 => string 'bed' (length=3)
      1 => string 'boy' (length=3)
      2 => string 'busy' (length=4)
      3 => string 'by' (length=2)
      4 => string 'by' (length=2)
  ...
 */

最新更新