使用网络服务器的IP地址而不是域名对其进行屏幕抓取



这可能吗?当baseUrl="http://mashable.com"但当我给它一个IP地址时就不起作用了。

<script src='https://raw.github.com/padolsey/jQuery-Plugins/master/cross-domain-ajax/jquery.xdomainajax.js'></script>
<script>$(document).ready(function () {
baseUrl = "https://12.34.56.78:8000/";
$.ajax({
    url: baseUrl,
    type: "get",
    dataType: "",
    success: function (data) {
        alert("Yeah we are om jere");
    });
});

这将很困难,因为许多网站可能托管在同一台服务器上,从而共享相同的IP。它可以与域名一起使用,因为您的客户端会在主机头中发送域名和GET请求。

请参阅堆栈溢出的卷曲输出:

C:UsersYeah>curl --head -i -v stackoverflow.com/
* Hostname was NOT found in DNS cache
*   Trying 198.252.206.140...
* Connected to stackoverflow.com (198.252.206.140) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: stackoverflow.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

您可以看到域名作为标头传递。相反,如果我尝试使用上面找到的IP地址进行查询,则会导致404错误:

C:UsersYeah>curl --head -i -v 198.252.206.140/
* Hostname was NOT found in DNS cache
*   Trying 198.252.206.140...
* Connected to 198.252.206.140 (198.252.206.140) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 198.252.206.140
> Accept: */*
>
< HTTP/1.1 404 Not Found
HTTP/1.1 404 Not Found
< [...]

举个反例,如果我尝试对Facebook网站做类似的事情,我会得到以下结果:

C:UsersYeah>curl --head -i -v --insecure -L https://www.facebook.com/
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 443 (#0)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

如果我尝试使用上面的IP地址:

C:UsersYeah>curl --head -i -v --insecure -L https://31.13.93.3/
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to 31.13.93.3 (31.13.93.3) port 443 (#0)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 31.13.93.3
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< Location: http://www.facebook.com/
Location: http://www.facebook.com/
< [...]
<
* Connection #0 to host 31.13.93.3 left intact
* Issue another request to this URL: 'http://www.facebook.com/'
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 80 (#1)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< [...]
<
* Connection #1 to host www.facebook.com left intact
* Issue another request to this URL: 'https://www.facebook.com/'
* Found bundle for host www.facebook.com: 0x6097814fe0
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 443 (#2)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

这里需要-L(遵循重定向)和--insecure(接受任何证书)来使cUrl最终连接到Facebook网站,但这些都是常见的客户端(即浏览器)操作。

因此,这实际上取决于您想要筛选废料的特定网站和服务器配置。

最新更新