PHP Simple HTML DOM Parser 收到图像错误



我尝试使用Simple HTML DOM解析器从指定的URL下载所有图像。它可以成功地将图像发送到我的文件夹(我之前已经创建了文件夹"hinh"(,但在那之后,它仍然返回许多关于file_get_contents()file_put_contents()的错误。

错误返回到我的屏幕:

Warning: file_get_contents(): Filename cannot be empty in D:xampphtdocscraindex.php on line 11
Warning: file_put_contents(hinh/): failed to open stream: No such file or directory in D:xampphtdocscraindex.php on line 11

我之前已经创建了文件夹"hinh",但它仍然有错误。这是我的代码:

<?php
    include('simple_html_dom.php'); //using PHP Simple HTML DOM Parser to get element in your link
    if(isset($_POST['submit'])){    //Because I want to make it visible to get multiple links, so I have designed a simple form to input your link. If you click on button, it will do job inside brackets, else do nothing. To make sure it work perfectly, you should check either $_POST['Url']. 
    $url=file_get_html($_POST['url']); //Here I get parameter at textfield to get URL
    $image = $url->find("img"); // Find all <img> tag in your link 
    foreach($image as $img) //Reach to every single line <img> in the destination link
    {
        $s=$img->src; //Get link of image
        $img_name = 'hinh/'.basename($s); //The important step in here. If you want to get file name of image and parse it to fuction save it to your disk (or host, of course!), you have to get file name of it, not a link.      
        file_put_contents($img_name, file_get_contents($s)); //Catch image and store it into place that specified before.
    }
    }   
?>

这是index.php

<form id="form1" name="form1" method="post" action="index.php">
  <table width="700" border="1" align="center" cellpadding="1" cellspacing="1">
    <tr>
      <td colspan="2"><label for="textfield"></label>
      <input style="width:100%;" type="text" name="url" id="textfield" />
      </td>
    </tr>
    <tr>
      <td colspan="2" align="center" valign="middle"><input type="submit" name="submit" id="button" value="Submit" /></td>
    </tr>
  </table>
</form>

您将每个图像 URL 作为绝对 URL 进行处理。假设您正在抓取的页面位于 http://example.com/pages/index.html。

这有效:<img src="http://example.com/images/1.jpg" />解析为在浏览器中http://example.com/images/1.jpg并在代码中http://example.com/images/1.jpg

这不起作用:<img src="/images/1.jpg" />解析为在浏览器中http://example.com/images/1.jpg并在代码中/images/1.jpg

您必须检查图像 src 是否包含相对或绝对 URL。否则,您将在文件系统中搜索图像,这可能会危及敏感数据(例如 <img src="/etc/shadow" /> (。

编辑:

在示例页面中,有一些带有空 src 属性的 img 标签,这些属性是使用 javascript 加载的。您可以像这样跳过这些:

<?php
    include('simple_html_dom.php');
    if(isset($_POST['submit'])) {
        $url=file_get_html($_POST['url']);
        $image = $url->find("img");
        foreach($image as $img) {
            if(!empty($img->src)) {
                $s=$img->src;
                $img_name = 'hinh/'.basename($s);
                file_put_contents($img_name, file_get_contents($s));
            }
        }
    }   
?>

请注意,在测试时,我发现在某些情况下,简单的html dom无法加载远程文件,它旨在加载本地文件,所以我使用了这个,它可能是一个更稳定的解决方案:

$html  = file_get_contents($_POST['url']);
$url   = str_get_html($html);
$image = $url->find("img");
...

最新更新