如何正确地从 HTML 字符串中抓取一些节点



我尝试从给定的html字符串中抓取一些节点:

$html = <<<'HTML'
<h1>Details au&szlig;en</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;

我需要字符串中的第一个h1和最后一个img节点。

为此,我使用了 DOMDocument,因为有了preg_match_all或类似的东西,我们可能会错过一些东西。

完整代码:

$html = <<<'HTML'
<h1>Details au&szlig;en</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;
$dom = new DOMDocument();
// since the libxml was designed for ISO-8859-1, this is a backwards hack
// @see https://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258
$dom->loadHTML(iconv('UTF-8', 'ISO-8859-1', $html),
    LIBXML_HTML_NOIMPLIED
);
$h1List = $dom->getElementsByTagName('h1');
$h1 = $h1List->item(0);
$imgList = $dom->getElementsByTagName('img');
$img = $imgList->item($imgList->length - 1);
$data = array(
    'tabTitle' => trim($dom->saveHTML($h1)),
    'tabImg' => trim($dom->saveHTML($img))
);

// remove both wrappers if empty
$imgWrapper = $img->parentNode;
$imgWrapper->removeChild($img);
if (!$imgWrapper->hasChildNodes()) {
    $imgWrapper->parentNode->removeChild($imgWrapper);
}
$h1Wrapper = $h1->parentNode;
$h1Wrapper->removeChild($h1);
if (!$h1Wrapper->hasChildNodes()) {
    $h1Wrapper->parentNode->removeChild($h1Wrapper);
}
$data['content'] = $dom->saveHTML();
var_dump($data);

预期产出:

array(
    'tabTitle' => '<h1>Details außen</h1>',
    'tabImg' => '<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">',
    'content' => '
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p>
'
);

但是我得到了以下输出:

array(3) {
  'tabTitle' =>
  string(501) "<h1>Details außen<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
</h1>"
  'tabImg' =>
  string(373) "<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">"
  'content' =>
  string(108) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
"
}

这是怎么回事?我正在使用 PHP 5.6。如果问题与 PHP 版本有关,则可以更改为 PHP 7。

这应该让你盯着看。首先,我使用 xpath 查询 DOMDocument,然后使用 saveXML 打印节点。

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes[] = $xpath->query('//h1')[0];
$nodes[] = $xpath->query('//img')[0];
foreach ($nodes as $node) {
    echo utf8_decode($dom->saveXML($node)) . PHP_EOL;
}

这是示例的输出:

<h1>Details außen</h1>
<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"/>

您可以将其格式化为所需的输出

最新更新