DOM and xpath query in a parsing HTML case



以下是我想要解析的HTML:

$html = '
<h1>title</h1>
<div id="main">
<div id="page">
<div class="article">
<h2><span>date1</span> <a href="link1">title1</a></h2>
<p>text1</p>
</div>
<div class="article">
<h2><span>date2</span> <a href="link2">title2</a></h2>
<p>text2</p>
</div>
</div>
</div>';

以下是我想要的:

Array
(
[0] => Array
    (
        [link] => link1
        [title] => title1
        [description] => description1
        [date] => date1
    )
[1] => Array
    (
        [link] => link2
        [title] => title2
        [description] => description2
        [date] => date2
    )
)

这是我的PHP:

$doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXpath($doc);
$nodes = $xpath->query("//div[@class='article']/h2/a");
$list = array(); $i = 0;
if($nodes)
{
foreach($nodes as $node) {
    if($node->getAttribute('href')) 
    { $link = $node->getAttribute('href'); $list[$i]['link'] = $link; }
    if($node->nodeValue) 
    { $title = $node->nodeValue; $list[$i]['title'] = $title; }
    if($node->nodeValue) 
    { $description = $node->nodeValue; $list[$i]['description'] = $description; }
    if($node->nodeValue) 
    { $date = $node->nodeValue; $list[$i]['date'] = $date; }
    $i++;
}
}
echo '<pre>';
echo print_r ($list);
echo '</pre>';

结果对于link1title1link2title2是可以的,但对于description1date1description2date2则不是。

我在PHP手册中寻找一些与我相似的具体案例。但大多数时候,在处理DOMdocument时,一切都是理论性的。你能帮我一下,或者给我推荐一些更具体的资源吗?

编辑:这是$node 的内容

DOMElement Object
(
[tagName] => a
[schemaTypeInfo] => 
[nodeName] => a
[nodeValue] => title1
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => 
[prefix] => 
[localName] => a
[baseURI] => 
[textContent] => title1
)
1
DOMElement Object
(
[tagName] => a
[schemaTypeInfo] => 
[nodeName] => a
[nodeValue] => title2
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => 
[prefix] => 
[localName] => a
[baseURI] => 
[textContent] => title2
)
1

通常我不会以这种方式工作,但这是您问题的解决方案,我将获取文章div而不是锚点:

$aNodes = $xpath->query("//div[@class='article']");
$aList = array(); 
$i = 0;
if($aNodes){
    foreach($aNodes as $aNode) {
        $aDates = $aNode->getElementsByTagName('span');
        foreach ($aDates as $sDate){
            $aList[$i]['date'] = $sDate->nodeValue;
        }
        $aLinks = $aNode->getElementsByTagName('a');
        foreach ($aLinks as $sLink){
            $aList[$i]['link']  = $sLink->getAttribute('href');
            $aList[$i]['linktext'] = $sLink->nodeValue;
        }
        $aTexts = $aNode->getElementsByTagName('p');
        foreach ($aTexts as $sText){
            $aList[$i]['descript'] = $sText->nodeValue;
        }
        $i++;
    }
}
echo '<pre>';
print_r ($aList);
echo '</pre>';

或者,如果您确定布局始终相同:

foreach($aNodes as $aNode) {
        $aList[$i]['date'] = $aNode->getElementsByTagName('span')->item(0)->nodeValue;
        $aList[$i]['link']  = $aNode->getElementsByTagName('a')->item(0)->getAttribute('href');
        $aList[$i]['linktext']  = $aNode->getElementsByTagName('a')->item(0)->nodeValue;
        $aList[$i]['descript']  = $aNode->getElementsByTagName('p')->item(0)->nodeValue;
        $i++;
    }

最新更新