如何删除元素的css选择器(标签,类和id)与Domcrawler



我如何在这里实现这个解决方案与Domcrawler?

<?php
use SymfonyComponentDomCrawlerCrawler;
$crawler = new Crawler();
$content = file_get_contents('http://example.com/somepage.html');
$crawler->addHtmlContent($content, 'UTF-8');
$content = $crawler->filter('#main-content');
// Remove content by tag and by css selector.
?>
    $crawler = new Crawler($html,$url);
    $document = new DOMDocument('1.0', 'UTF-8');
    $root = $document->appendChild($document->createElement('_root'));
    $crawler->rewind();
    $root->appendChild($document->importNode($crawler->current(), true));
    $domxpath = new DOMXPath($document);
    foreach ($selectorsToRemove as $selector) {
        $crawlerInverse = $domxpath->query(CssSelector::toXPath($selector));
        foreach ($crawlerInverse as $elementToRemove) {
            $parent = $elementToRemove->parentNode;
            $parent->removeChild($elementToRemove);
        }
    }
    $crawler->clear();
    $crawler->add($document);

Crawler类扩展SplObjectStorage,当爬虫接收到HTML时,它使用attach()方法将每个元素添加到存储中。

这意味着在爬虫对象上也可以使用detach()方法。我还没有测试下面的代码,但我认为这应该可以完成工作。

$crawlerInverse = $crawler->filter('script');
foreach ($crawlerInverse as $elementToRemove) {
    if ($crawler->contains($elementToRemove)) {
       $crawler->detach($elementToRemove);
    }
}

如文档中所述:

DomCrawler组件简化了HTML和XML文档的DOM导航。

也:

虽然可能,但DomCrawler组件不是为操纵DOM或重新转储HTML/XML而设计的。

DomCrawler旨在从DOM文档中提取细节,而不是修改它们。

然而…

由于PHP通过引用传递对象,而Crawler基本上是DOM节点的包装器,因此从技术上讲,可以修改底层DOM文档:

// will remove all divs with a class .toRemove
$crawler->filter('div.toRemove')->each(function ($node) {
    foreach ($crawler as $node) {
        $node->parentNode->removeChild($node);
    }
});

下面是一个工作示例:https://gist.github.com/jakzal/8dd52d3df9a49c1e5922

使用常用的函数,如:

function removeCrawlerNode($crawler_node) {
    foreach($crawler_node as $node) {
        $node->parentNode->removeChild($node);
    }
}

然后找到你想要搜索的爬虫代码部分(比如类。sample_section),如果它存在,然后用你想要删除的所有标签创建一个remove_tag_array:

if($crawler->filter('.sample_section')->count() > 0) {
    $remove_tag_array = array("br", "b", "img", "div", "u", "i");
    $sub_crawler = $crawler->filter('.sample_section');
    foreach ($remove_tag_array as $tag) {
        $sub_crawler->filter($tag)->each(function ($node) {
            removeCrawlerNode($node);
        });
    }
}

相关内容

最新更新