如果数据是由Javascript加载的，如何使用php Goutte和Guzzle进行爬网

很多时候，在爬网时，我们会遇到问题，页面上呈现的内容是用Javascript生成的，因此scratchy无法爬网（例如ajax请求、jQuery）

您想了解一下phantomjs。有一个php实现：

http://jonnnnyw.github.io/php-phantomjs/

当然，如果您需要让它与php一起工作。

你可以阅读页面，然后将内容提供给Guzzle，以便使用Guzzle为你提供的漂亮功能（如搜索内容等）。这将取决于你的需求，也许你可以简单地使用dom，如下所示：

如何按类名获取元素？

这是一些工作代码。

  $content = $this->getHeadlessReponse($url);
  $this->crawler->addContent($this->getHeadlessReponse($url));
  /**
   * Get response using a headless browser (phantom in this case).
   *
   * @param $url
   *   URL to fetch headless
   *
   * @return string
   *   Response.
   */
public function getHeadlessReponse($url) {
    // Fetch with phamtomjs
    $phantomClient = PhantomClient::getInstance();
    // and feed into the crawler.
    $request = $phantomClient->getMessageFactory()->createRequest($url, 'GET');
    /**
     * @see JonnyWPhantomJsHttpResponse
     **/
    $response = $phantomClient->getMessageFactory()->createResponse();
    // Send the request
    $phantomClient->send($request, $response);
    if($response->getStatus() === 200) {
        // Dump the requested page content
        return $response->getContent();
    }
}

使用phantom的唯一缺点是，它会比guzzle慢，但当然，你必须等待所有那些讨厌的js被加载。

Guzzle（Goutte在内部使用）是一个HTTP客户端。因此，javascript内容将不会被解析或执行。驻留在请求端点之外的Javascript文件将不会被下载。

根据您的环境，我认为可以使用PHPv8（一个嵌入Google V8 javascript引擎的PHP扩展）和自定义处理程序/中间件来执行您想要的操作。

再说一遍，根据您的环境，使用javascript客户端简单地执行抓取可能会更容易。

我建议尝试获取响应内容。将其解析为新的html（如果必须的话），并在初始化新的Crawler对象时将其用作$html，之后您可以像其他任何Crawler物体一样使用所有数据作为响应。

$crawler = $client->submit($form);
$html = $client->getResponse()->getContent();
$newCrawler = new Crawler($html);

相关内容

最新更新

热门标签：