我一直在为此挠头,但无法找到解决方案。
假设你有一个5000个字符的文本,我想把它分成不到500个字符的块,但是,不打断一句话。例如:如果一段是550个单词,最后一句以550个字符结尾,但以450个字符开头,我想将这个特定的块保存为最多450个字符(这样就不会断句(。
你知道如何做到这一点吗?
我的目标是将每个块保存到一个数组中,这样我就可以分别处理它们。
我在考虑使用preg_split,对输出求和,如果和超过500个字符,则删除最后一个和。但是我发现很难把句子分开而不出错。
你知道我应该用什么预拆分规则来确保每一个句子都被很好地分隔吗?
我试着使用这个工具,但无法获得正确的输出:https://www.phpliveregex.com/#tab-预裂
感谢
首先:谢谢你的提问
解决方案并不稳定,您必须在未来进行调整。但它将向您展示存档此文件的可能方法。
将文本拆分为单独的句子,并将每个句子保存为数组中的一个元素。通过这种方式,您可以在迭代数组时确定句子的长度。只要句子和前一句小于最大块长度,就将字符串放入临时变量中。一旦临时变量的文本长度+当前记录的长度大于最大块长度,该记录就会作为块存储在新的数组中。
<?php
$txt = "111. 222 222. 333 333 333. 444 444 444 444. 555 555 555 555 555. 333 333 333. 222 222. 111.";
$length = 30;
$arr = explode(". ", $txt);
$b = [];
$tmp = '';
foreach($arr as $k => $s) {
if (strlen($s) + strlen($tmp) <= ($length) ) {
$tmp = $tmp . $s .'. ';
} else {
$b[] = $tmp;
$tmp = '';
$tmp = $s . '. ';
}
if((count($arr)-1) === $k) {
$b[] = substr($tmp, 0, -2);
}
}
print_r($arr);
print_r($b);
输出
// Sentence Array
Array
(
[0] => 111
[1] => 222 222
[2] => 333 333 333
[3] => 444 444 444 444
[4] => 555 555 555 555 555
[5] => 333 333 333
[6] => 222 222
[7] => 111.
)
// Your new Block Array
Array
(
[0] => 111. 222 222. 333 333 333.
[1] => 444 444 444 444.
[2] => 555 555 555 555 555.
[3] => 333 333 333. 222 222. 111.
)
按句子拆分似乎更容易,如果超出边界,则应该能够循环并连接
$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.
Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';
$splited = preg_split('/([^.]+.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`
$cleaned = array_filter(array_map('trim', $splited));
var_dump($cleaned);
我有
array(22) {
[1]=>
string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[3]=>
string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
[5]=>
string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
[7]=>
string(42) "Elementum facilisis leo vel fringilla est."
[9]=>
string(27) "Sem et tortor consequat id."
[11]=>
string(44) "Eleifend donec pretium vulputate sapien nec."
[13]=>
string(43) "Elit pellentesque habitant morbi tristique."
[15]=>
string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
[17]=>
string(40) "Quis commodo odio aenean sed adipiscing."
[19]=>
string(53) "Id volutpat lacus laoreet non curabitur gravida arcu."
[21]=>
string(40) "Sit amet massa vitae tortor condimentum."
[23]=>
string(49) "Morbi blandit cursus risus at ultrices mi tempus."
[25]=>
string(50) "Tortor consequat id porta nibh venenatis cras sed."
[27]=>
string(38) "Urna et pharetra pharetra massa massa."
[29]=>
string(32) "Ut consequat semper viverra nam."
[31]=>
string(47) "Hac habitasse platea dictumst quisque sagittis."
[33]=>
string(46) "Commodo odio aenean sed adipiscing diam donec."
[35]=>
string(45) "Imperdiet proin fermentum leo vel orci porta."
[37]=>
string(40) "Quisque non tellus orci ac auctor augue."
[39]=>
string(37) "In cursus turpis massa tincidunt dui."
[41]=>
string(38) "Purus faucibus ornare suspendisse sed."
[43]=>
string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}
Maik的快速更新;(
$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.
Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';
$splited = preg_split('/([^.]+.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`
$cleaned = array_filter(array_map('trim', $splited));
$lines = [];
$current = '';
$min = 50;
foreach ($cleaned as $sentence) {
$current .= $sentence . ' '; // Mandatory to allow to add an other sentence
$len_current = strlen($current);
if ($len_current >= $min) {
array_push($lines, trim($current)); // As we add an extra space, we remove it when adding to the lines
$current = '';
}
}
看起来像这个
array(14) {
[0]=>
string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
[1]=>
string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
[2]=>
string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
[3]=>
string(70) "Elementum facilisis leo vel fringilla est. Sem et tortor consequat id."
[4]=>
string(88) "Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique."
[5]=>
string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
[6]=>
string(94) "Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu."
[7]=>
string(90) "Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus."
[8]=>
string(50) "Tortor consequat id porta nibh venenatis cras sed."
[9]=>
string(71) "Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam."
[10]=>
string(94) "Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec."
[11]=>
string(86) "Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue."
[12]=>
string(76) "In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed."
[13]=>
string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}
我想你需要这个
$string = "Hello world php is fun";
$array = explode(" ", $string);
输出是
Array ( [0] => Hello [1] => world [2] => php [3] => is [4] => fun )