php DOMNode操作缓慢中断



我有一个脚本,它采用一些HTML并试图从中提取一些数据。CardTitle,数据如下。不幸的是,所有字段和数据都是兄弟关系,因此很难提取。这是我当前的脚本(简称为相关要点):

$time = microtime(true);
$curr_card = array();
$item = $list->item($i);
$cardPath = getHTML($base . $item->getAttribute('href'));
$time = microtime(true) - $time;
echo 'Time to download and load card info: ' . $time . '<br />';
$title = $cardPath->evaluate('//div[@class='WordSection1']/h4')->item(0)->textContent;
preg_match('/s(([A-Za-z0-9]+))/', $title, $curr_set);
$curr_card['set'] = $curr_set[1];
$curr_card['card_name'] = preg_replace('/s([A-Za-z0-9]+)/', '', $title);
echo 'Getting field data for ' . $curr_card['card_name'] . '<br />';
$fields = $cardPath->evaluate('//div[@class='WordSection1']/p[@class='Definition']/span[@class='CardTitle']');
$time = $field_time = microtime(true);
echo '# of fields: ' . $fields->length . '<br />';
for($a = 0; $a < $fields->length; $a++)
{
    $field = $fields->item($a);
    $fieldName = $field->textContent;
    echo 'Field Name: ' . $fieldName . '<br />';
    $fieldData = recursiveSibling($field->nextSibling);
    echo 'Field Data: ' . $fieldData . '<br />';
    $field_time = microtime(true) - $field_time;
    $fieldnum = $a + 1;
    echo 'Field #' . $fieldnum . ' took ' . $field_time . ' to process. <br />';
$field_time = microtime(true);
}
$time = microtime(true) - $time;
echo 'Time to extract card info: ' . $time . '<br />';
function getHTML($url, $xpath = true)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6');
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html = curl_exec($ch);
    if (!$html) {
        echo "<br />cURL error number:" .curl_errno($ch);
        echo "<br />cURL error:" . curl_error($ch);
        exit;
    }
    if($xpath)
    {
        $dom = new DOMDocument();
        @$dom->loadHTML($html); 
        return new DOMXPath($dom);
    }
    else
        return $html;
}
function recursiveSibling($node)
{
    if(strstr($node->nodeName, 'span') === false)
    {
        $text = $node->textContent . recursiveSibling($node->nextSibling);
        return $text;
    }
}

这是脚本将输出的内容:

Time to download and load master list: 0.495495080948
Time to download and load card info: 0.106231927872
Getting field data for A Child is Born
# of fields: 9
Field Name: Type: 
Field Data: Hero Enh. • 
Field #1 took 3.60012054443E-5 to process. 
Field Name: Brigade: 
Field Data: White • 
Field #2 took 1.00135803223E-5 to process. 
Field Name: Ability: 
Field Data: None • 
Field #3 took 8.10623168945E-6 to process. 
Field Name: Class: 
Field Data: None • 
Field #4 took 7.15255737305E-6 to process. 
Field Name: Special Ability: 
Field Data: Discard all Demons in Play. Cannot be interrupted, negated, or prevented. • 
Field #5 took 3.31401824951E-5 to process. 
Field Name: Errata: 
Field Data: Discard all demons in play. Cannot be negated. • 
Field #6 took 1.50203704834E-5 to process. 
Field Name: Identifiers: 
Field Data: None • 
Field #7 took 6.91413879395E-6 to process. 
Field Name: Verse: 
Field Data: None • 
Field #8 took 5.96046447754E-6 to process. 
Field Name: Availability: 

我不明白为什么执行需要这么长时间(大约40秒),我更不明白为什么最后一个字段会破坏脚本。如果有帮助的话,这是我从中提取的页面:http://www.redemptionreg.com/REG/Master/achildisbornp.htm

如果有人能向我解释我做错了什么,以及如何让它更快,我将不胜感激。有2000多张卡可以执行此操作,每张45秒时,脚本执行时间超过24小时!

我解决了这个问题。整个问题是在最后一个字段"可用性"之后没有跨度。因此,递归Sibling函数进入了无限递归。在添加了一个条件来检查是否还有另一个节点之后,它就工作了。

最新更新