我如何在这里实现这个解决方案与Domcrawler?
<?php
use SymfonyComponentDomCrawlerCrawler;
$crawler = new Crawler();
$content = file_get_contents('http://example.com/somepage.html');
$crawler->addHtmlContent($content, 'UTF-8');
$content = $crawler->filter('#main-content');
// Remove content by tag and by css selector.
?>
$crawler = new Crawler($html,$url);
$document = new DOMDocument('1.0', 'UTF-8');
$root = $document->appendChild($document->createElement('_root'));
$crawler->rewind();
$root->appendChild($document->importNode($crawler->current(), true));
$domxpath = new DOMXPath($document);
foreach ($selectorsToRemove as $selector) {
$crawlerInverse = $domxpath->query(CssSelector::toXPath($selector));
foreach ($crawlerInverse as $elementToRemove) {
$parent = $elementToRemove->parentNode;
$parent->removeChild($elementToRemove);
}
}
$crawler->clear();
$crawler->add($document);
Crawler
类扩展SplObjectStorage
,当爬虫接收到HTML时,它使用attach()
方法将每个元素添加到存储中。
这意味着在爬虫对象上也可以使用detach()
方法。我还没有测试下面的代码,但我认为这应该可以完成工作。
$crawlerInverse = $crawler->filter('script');
foreach ($crawlerInverse as $elementToRemove) {
if ($crawler->contains($elementToRemove)) {
$crawler->detach($elementToRemove);
}
}
如文档中所述:
DomCrawler组件简化了HTML和XML文档的DOM导航。
也:
虽然可能,但DomCrawler组件不是为操纵DOM或重新转储HTML/XML而设计的。
DomCrawler旨在从DOM文档中提取细节,而不是修改它们。
然而…由于PHP通过引用传递对象,而Crawler基本上是DOM节点的包装器,因此从技术上讲,可以修改底层DOM文档:
// will remove all divs with a class .toRemove
$crawler->filter('div.toRemove')->each(function ($node) {
foreach ($crawler as $node) {
$node->parentNode->removeChild($node);
}
});
下面是一个工作示例:https://gist.github.com/jakzal/8dd52d3df9a49c1e5922
使用常用的函数,如:
function removeCrawlerNode($crawler_node) {
foreach($crawler_node as $node) {
$node->parentNode->removeChild($node);
}
}
然后找到你想要搜索的爬虫代码部分(比如类。sample_section),如果它存在,然后用你想要删除的所有标签创建一个remove_tag_array:
if($crawler->filter('.sample_section')->count() > 0) {
$remove_tag_array = array("br", "b", "img", "div", "u", "i");
$sub_crawler = $crawler->filter('.sample_section');
foreach ($remove_tag_array as $tag) {
$sub_crawler->filter($tag)->each(function ($node) {
removeCrawlerNode($node);
});
}
}