如何获取HTML内容中最中间元素的字符串位置?

我正在处理HTML格式的新闻文章，这些文章来自所见即所得的编辑器，我需要找到它的中间，但在视觉/HTML上下文中，这意味着两个根元素之间的空白位置。如果你想把文章分成两页，比如说，尽可能每页的段落数量相等。

所有根元素似乎都以段落的形式出现，这很容易计数，一个简单的

$p_count = substr_count($article_text, '<p');

返回开始段落标签的总数，然后我可以查找段落($p_count/2)次出现的strpos。

但问题是嵌入的推文，其中包含段落，有时出现在blockquote > p，有时显示为center > blockquote > p.

所以我转向DOMDocument。这个小片段给了我中间的第 n 个元素(即使元素是div 而不是段落，这很酷)：

$dom = new DOMDocument();
$dom->loadHTML($article_text);
$body = $dom->getElementsByTagName('body');
$rootNodes = $body->item(0)->childNodes;
$empty_nodes = 0;
foreach($rootNodes as $node) {
if($node->nodeType === XML_TEXT_NODE && strlen(trim($node->nodeValue)) === 0) {
$empty_nodes++;
}
}
$total_elements = $rootNodes->length - $empty_nodes;
$middle_element = floor($total_elements / 2);

但是我现在如何在原始 HTML 源代码中找到此中间元素的字符串偏移量，以便我可以指向文章文本字符串中的这个中间位置？特别是考虑到 DOMDocument 将我给它的 HTML 转换为完整的 HTML 页面(带有文档类型和头部等等)，因此它的输出 HTML 比我的原始 HTML 文章源大。

好的，我解决了。

我所做的是使用preg_match_all的PREG_OFFSET_CAPTURE标志匹配文章中的所有HTML标签，该标志记住了模式匹配的字符偏移量。然后我按顺序遍历所有这些，并计算我所处的深度;如果是开始标签，我会计算深度+1，以及结束-1(注意自结束标签)。每次在结束标记后深度变为零时，我都会将其视为另一个关闭的根元素。如果最后我最终到达了深度0，我假设我数对了。

现在，我可以取我计算的根元素的数量，除以 2 得到中间的根元素(+-1 表示奇数)，并查看元素在该索引处的偏移量，如preg_match_all之前报告的那样。

如果有人需要做同样的事情，请填写完整的代码。

如果使用正则表达式编写is_self_closing()函数然后检查in_array($self_closing_tags)而不是foreach循环，则可能会加快速度，但就我而言，它并没有产生足够的差异让我打扰。

function calculate_middle_of_article(string $text, bool $debug=false): ?int {

function is_self_closing(string $input, array $self_closing_tags): bool {
foreach($self_closing_tags as $tag) {
if(substr($input, 1, strlen($tag)) === $tag) {
return true;
}
}
return false;
}
$self_closing_tags = [
'!--',
'area',
'base',
'br',
'col',
'embed',
'hr',
'img',
'input',
'link',
'meta',
'param',
'source',
'track',
'wbr',
'command',
'keygen',
'menuitem',
];
$regex = '/<("[^"]*"|'[^']*'|[^'">])*>/';
preg_match_all($regex, $text, $matches, PREG_OFFSET_CAPTURE);
$debug && print count($matches[0]) . " tags found   n";
$root_elements = [];
$depth = 0;
foreach($matches[0] as $match) {
if(!is_self_closing($match[0], $self_closing_tags)) {
$depth+= (substr($match[0], 1, 1) === '/') ? -1 : 1;
}
$debug && print "level {$depth} after tag: " . htmlentities($match[0]) . "n";
if($depth === 0) {
$root_elements[]= $match;
}
}
$ok = ($depth === 0);
$debug && print ($ok ? 'ok' : 'not ok') . "n";
// has to end at depth zero to confirm counting is correct
if(!$ok) {
return null;
}
$debug && print count($root_elements) . " root elementsn";
$element_index_at_middle = floor(count($root_elements)/2);
$half_char = $root_elements[$element_index_at_middle][1];
$debug && print "which makes the half the {$half_char}th character at the {$element_index_at_middle}th elementn";
return $half_char;
}

相关内容

最新更新

热门标签：