php卷曲网络cr.我遇到了一些问题.它将我返回空白页.当我尝试获取特定内容时



php curl Webscrape ..我遇到了一些问题。它将我返回空白页。当我尝试获取特定内容时。

例如..

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
  <table>
    <tr>
       <td>test1</td>
       <td>test1</td>
       <td>test1</td>
    </tr>
  </table>
</body>
</html>

我只需要<tr></tr>

的内容

这是我的代码段。

// Defining the basic cURL function
function curl($url) {
    // Assigning cURL options to an array
    $options = Array(
        CURLOPT_SSL_VERIFYPEER => false,
        // CURLOPT_CAINFO => 'cacert.pem',
        CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
        CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
        CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
        CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
        CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
        CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
        CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
    );
    $ch = curl_init();  // Initialising cURL 
    curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);    // Closing cURL 
    return $data;   // Returning the data from the function 
}


// Defining the basic scraping function
function scrape_between($data, $start, $end){
    $data = stristr($data, $start); // Stripping all data from before $start
    $data = substr($data, strlen($start));  // Stripping $start
    $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
    $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
    return $data;   // Returning the scraped data from the function
}

$url = "https://www.weddingwire.com/c/ak-alaska/wedding-officiants/9-sca.html";      
$results_page = curl($url); // Downloading the results page using our curl() funtion
$results_page = scrape_between($results_page, '<div class="js-search-results">', '<div class="col-xs-12 testing-catalog-pagination-links">'); // Scraping out only the middle section of the results page that contains our results

我需要解析的数据!

这是我必须给信用的源代码。http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/

使用普通字符串函数是解析标记数据的一种有效的方法。正则表达式提供了更大的灵活性,但是基于它们的解决方案通常对标记结构的微小变化几乎没有鲁棒性。

最好的是使用DOM解析器。它们是适当的工具,专门用于此类任务。将标记扔到解析器对象中,然后可以"浏览"结构,然后选择并提取所需的任何数据。

请看这个简单的示例:

<?php
require_once 'simple_html_dom.php';
$markup = <<<EOT
<html>
  <body>
    <div>foo 1</div>
    <div class="category-landing-links">
      <div class="col-xs-4">bla 1</div>
      <div class="col-xs-4">bla 2</div>
      <div class="col-xs-4">bla 3</div>
    </div>
    <div>foo 2</div>
  </body>
</html>
EOT;
$htmlDom = str_get_html($markup);
$outerDivs = $htmlDom->find('div[class=category-landing-links]');
$finalData = [];
foreach ($outerDivs as $key=>$outerDiv) {
  foreach ($outerDiv->children() as $innerDiv) {
    $finalData[$key][] = $innerDiv->innertext;
  }
}
var_dump($finalData);

上述输出是:

array(1) {
  [0] =>
  array(3) {
    [0] =>
    string(5) "bla 1"
    [1] =>
    string(5) "bla 2"
    [2] =>
    string(5) "bla 3"
  }
}

因此,一个持有每个匹配的外部<div>标签的阵列,该数组依次保留所有孩子<div>标签的内部文本。根据特定情况,您可能必须根据您的需求调整它。

simple_html_dom.php是一个非常简单的DOM解析器的实现。它有点老,但正常工作。它是SourceForge提供的免费软件。在线文档提供了易于理解和演示大多数功能的示例。

最新更新