我有一个脚本,它采用一些HTML并试图从中提取一些数据。CardTitle,数据如下。不幸的是,所有字段和数据都是兄弟关系,因此很难提取。这是我当前的脚本(简称为相关要点):
$time = microtime(true);
$curr_card = array();
$item = $list->item($i);
$cardPath = getHTML($base . $item->getAttribute('href'));
$time = microtime(true) - $time;
echo 'Time to download and load card info: ' . $time . '<br />';
$title = $cardPath->evaluate('//div[@class='WordSection1']/h4')->item(0)->textContent;
preg_match('/s(([A-Za-z0-9]+))/', $title, $curr_set);
$curr_card['set'] = $curr_set[1];
$curr_card['card_name'] = preg_replace('/s([A-Za-z0-9]+)/', '', $title);
echo 'Getting field data for ' . $curr_card['card_name'] . '<br />';
$fields = $cardPath->evaluate('//div[@class='WordSection1']/p[@class='Definition']/span[@class='CardTitle']');
$time = $field_time = microtime(true);
echo '# of fields: ' . $fields->length . '<br />';
for($a = 0; $a < $fields->length; $a++)
{
$field = $fields->item($a);
$fieldName = $field->textContent;
echo 'Field Name: ' . $fieldName . '<br />';
$fieldData = recursiveSibling($field->nextSibling);
echo 'Field Data: ' . $fieldData . '<br />';
$field_time = microtime(true) - $field_time;
$fieldnum = $a + 1;
echo 'Field #' . $fieldnum . ' took ' . $field_time . ' to process. <br />';
$field_time = microtime(true);
}
$time = microtime(true) - $time;
echo 'Time to extract card info: ' . $time . '<br />';
function getHTML($url, $xpath = true)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6');
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
if($xpath)
{
$dom = new DOMDocument();
@$dom->loadHTML($html);
return new DOMXPath($dom);
}
else
return $html;
}
function recursiveSibling($node)
{
if(strstr($node->nodeName, 'span') === false)
{
$text = $node->textContent . recursiveSibling($node->nextSibling);
return $text;
}
}
这是脚本将输出的内容:
Time to download and load master list: 0.495495080948
Time to download and load card info: 0.106231927872
Getting field data for A Child is Born
# of fields: 9
Field Name: Type:
Field Data: Hero Enh. •
Field #1 took 3.60012054443E-5 to process.
Field Name: Brigade:
Field Data: White •
Field #2 took 1.00135803223E-5 to process.
Field Name: Ability:
Field Data: None •
Field #3 took 8.10623168945E-6 to process.
Field Name: Class:
Field Data: None •
Field #4 took 7.15255737305E-6 to process.
Field Name: Special Ability:
Field Data: Discard all Demons in Play. Cannot be interrupted, negated, or prevented. •
Field #5 took 3.31401824951E-5 to process.
Field Name: Errata:
Field Data: Discard all demons in play. Cannot be negated. •
Field #6 took 1.50203704834E-5 to process.
Field Name: Identifiers:
Field Data: None •
Field #7 took 6.91413879395E-6 to process.
Field Name: Verse:
Field Data: None •
Field #8 took 5.96046447754E-6 to process.
Field Name: Availability:
我不明白为什么执行需要这么长时间(大约40秒),我更不明白为什么最后一个字段会破坏脚本。如果有帮助的话,这是我从中提取的页面:http://www.redemptionreg.com/REG/Master/achildisbornp.htm
如果有人能向我解释我做错了什么,以及如何让它更快,我将不胜感激。有2000多张卡可以执行此操作,每张45秒时,脚本执行时间超过24小时!
我解决了这个问题。整个问题是在最后一个字段"可用性"之后没有跨度。因此,递归Sibling函数进入了无限递归。在添加了一个条件来检查是否还有另一个节点之后,它就工作了。