以下是我想要解析的HTML:
$html = '
<h1>title</h1>
<div id="main">
<div id="page">
<div class="article">
<h2><span>date1</span> <a href="link1">title1</a></h2>
<p>text1</p>
</div>
<div class="article">
<h2><span>date2</span> <a href="link2">title2</a></h2>
<p>text2</p>
</div>
</div>
</div>';
以下是我想要的:
Array
(
[0] => Array
(
[link] => link1
[title] => title1
[description] => description1
[date] => date1
)
[1] => Array
(
[link] => link2
[title] => title2
[description] => description2
[date] => date2
)
)
这是我的PHP:
$doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXpath($doc);
$nodes = $xpath->query("//div[@class='article']/h2/a");
$list = array(); $i = 0;
if($nodes)
{
foreach($nodes as $node) {
if($node->getAttribute('href'))
{ $link = $node->getAttribute('href'); $list[$i]['link'] = $link; }
if($node->nodeValue)
{ $title = $node->nodeValue; $list[$i]['title'] = $title; }
if($node->nodeValue)
{ $description = $node->nodeValue; $list[$i]['description'] = $description; }
if($node->nodeValue)
{ $date = $node->nodeValue; $list[$i]['date'] = $date; }
$i++;
}
}
echo '<pre>';
echo print_r ($list);
echo '</pre>';
结果对于link1
、title1
、link2
、title2
是可以的,但对于description1
、date1
、description2
、date2
则不是。
我在PHP手册中寻找一些与我相似的具体案例。但大多数时候,在处理DOMdocument时,一切都是理论性的。你能帮我一下,或者给我推荐一些更具体的资源吗?
编辑:这是$node 的内容
DOMElement Object
(
[tagName] => a
[schemaTypeInfo] =>
[nodeName] => a
[nodeValue] => title1
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => a
[baseURI] =>
[textContent] => title1
)
1
DOMElement Object
(
[tagName] => a
[schemaTypeInfo] =>
[nodeName] => a
[nodeValue] => title2
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => a
[baseURI] =>
[textContent] => title2
)
1
通常我不会以这种方式工作,但这是您问题的解决方案,我将获取文章div而不是锚点:
$aNodes = $xpath->query("//div[@class='article']");
$aList = array();
$i = 0;
if($aNodes){
foreach($aNodes as $aNode) {
$aDates = $aNode->getElementsByTagName('span');
foreach ($aDates as $sDate){
$aList[$i]['date'] = $sDate->nodeValue;
}
$aLinks = $aNode->getElementsByTagName('a');
foreach ($aLinks as $sLink){
$aList[$i]['link'] = $sLink->getAttribute('href');
$aList[$i]['linktext'] = $sLink->nodeValue;
}
$aTexts = $aNode->getElementsByTagName('p');
foreach ($aTexts as $sText){
$aList[$i]['descript'] = $sText->nodeValue;
}
$i++;
}
}
echo '<pre>';
print_r ($aList);
echo '</pre>';
或者,如果您确定布局始终相同:
foreach($aNodes as $aNode) {
$aList[$i]['date'] = $aNode->getElementsByTagName('span')->item(0)->nodeValue;
$aList[$i]['link'] = $aNode->getElementsByTagName('a')->item(0)->getAttribute('href');
$aList[$i]['linktext'] = $aNode->getElementsByTagName('a')->item(0)->nodeValue;
$aList[$i]['descript'] = $aNode->getElementsByTagName('p')->item(0)->nodeValue;
$i++;
}