如何从网站下载html的副本



如何从具有语言检测(例如google, youtube)和重定向的网站下载html的副本?我已经尝试过file_get_contents,但它是有限的。

我试图在php中使用curl从www.google.com获取html,但它检测到我来自英国,并向我发送302重定向到www.google.co.uk。

我已经尝试了很多不同的东西没有快乐,这是可能的吗?像www.markosweb.com这样的网站可以做到。

我代码:

$ch  = curl_init( "http://www.google.com/" );    
// $userAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";
//  $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

$header = array(
         "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
         "Accept-Language: en-US,us;q=0.7,en-us;q=0.5,en;q=0.3",
         "Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7",
         "Keep-Alive: 300");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE); //TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5); //The number of seconds to wait while trying to connect. 
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); //The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE); //To fail silently if the HTTP code returned is greater than or equal to 400.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //To follow any "Location: " header that the server sends as part of the HTTP header.
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE); //To automatically set the Referer: field in requests where it follows a Location: redirect.
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //The maximum number of seconds to allow cURL functions to execute.  
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, 0);
$content = curl_exec( $ch );    
$err     = curl_errno( $ch );    
$errmsg  = curl_error( $ch );    
$header  = curl_getinfo( $ch );    
curl_close( $ch );    
$header['errno']   = $err;    
$header['errmsg']  = $errmsg;    
$header['content'] = $content;    
return $header;

我试过改变用户代理到很多东西,试过有和没有头的细节。如果我使用标题info: "Accept-Language: ru-ru,ru;q=0.7,en-us;q=0.5,en;q=0.3",我设法得到了一些东西,但它是俄语或其他东西。

谢谢你的帮助。卡尔

试试这个代理脚本:

// Change these configuration options if needed, see above descriptions for info.
$enable_jsonp    = false;
$enable_native   = false;
$valid_url_regex = '/.*/';
// ############################################################################
$url = $_GET['url'];
if ( !$url ) {
  // Passed url not specified.
  $contents = 'ERROR: url not specified';
  $status = array( 'http_code' => 'ERROR' );
} else if ( !preg_match( $valid_url_regex, $url ) ) {
  // Passed url doesn't match $valid_url_regex.
  $contents = 'ERROR: invalid url';
  $status = array( 'http_code' => 'ERROR' );
} else {
  $ch = curl_init( $url );
  if ( strtolower($_SERVER['REQUEST_METHOD']) == 'post' ) {
    curl_setopt( $ch, CURLOPT_POST, true );
    curl_setopt( $ch, CURLOPT_POSTFIELDS, $_POST );
  }
  if ( $_GET['send_cookies'] ) {
    $cookie = array();
    foreach ( $_COOKIE as $key => $value ) {
      $cookie[] = $key . '=' . $value;
    }
    if ( $_GET['send_session'] ) {
      $cookie[] = SID;
    }
    $cookie = implode( '; ', $cookie );
    curl_setopt( $ch, CURLOPT_COOKIE, $cookie );
  }
  curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
  curl_setopt( $ch, CURLOPT_HEADER, true );
  curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
  curl_setopt( $ch, CURLOPT_USERAGENT, $_GET['user_agent'] ? $_GET['user_agent'] : $_SERVER['HTTP_USER_AGENT'] );
  list( $header, $contents ) = preg_split( '/([rn][rn])\1/', curl_exec( $ch ), 2 );
  $status = curl_getinfo( $ch );
  curl_close( $ch );
}
// Split header text into an array.
$header_text = preg_split( '/[rn]+/', $header );
if ( $_GET['mode'] == 'native' ) {
  if ( !$enable_native ) {
    $contents = 'ERROR: invalid mode';
    $status = array( 'http_code' => 'ERROR' );
  }
  // Propagate headers to response.
  foreach ( $header_text as $header ) {
    if ( preg_match( '/^(?:Content-Type|Content-Language|Set-Cookie):/i', $header ) ) {
      header( $header );
    }
  }
  print $contents;
} else {
  // $data will be serialized into JSON data.
  $data = array();
  // Propagate all HTTP headers into the JSON data object.
  if ( $_GET['full_headers'] ) {
    $data['headers'] = array();
    foreach ( $header_text as $header ) {
      preg_match( '/^(.+?):s+(.*)$/', $header, $matches );
      if ( $matches ) {
        $data['headers'][ $matches[1] ] = $matches[2];
      }
    }
  }
  // Propagate all cURL request / response info to the JSON data object.
  if ( $_GET['full_status'] ) {
    $data['status'] = $status;
  } else {
    $data['status'] = array();
    $data['status']['http_code'] = $status['http_code'];
  }
  // Set the JSON data object contents, decoding it from JSON if possible.
  $decoded_json = json_decode( $contents );
  $data['contents'] = $decoded_json ? $decoded_json : $contents;
  // Generate appropriate content-type header.
  $is_xhr = strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest';
  header( 'Content-type: application/' . ( $is_xhr ? 'json' : 'x-javascript' ) );
  // Get JSONP callback.
  $jsonp_callback = $enable_jsonp && isset($_GET['callback']) ? $_GET['callback'] : null;
  // Generate JSON/JSONP string
  $json = json_encode( $data );
  print $jsonp_callback ? "$jsonp_callback($json)" : $json;
}

确保执行这样的请求:http://example.com/script?url=http://whateverurl.com/

哦,这个PHP脚本将以JSON格式显示结果。从这里,您可以使用jQuery解析它。

就像我使用这个jQuery代码:

   <script type="text/javascript">
$(document).ready(function(){
var url='+++++URL WHICH THE PHP PROXY SCRIPT IS IN++++++';
$(window).load(function(){
        $.getJSON(url,function(json){
               $("#resu").append(""+json.contents+"");
        });
    });
});
</script>

Edit:这个脚本不是一个真正的代理,因为它没有伪造IP地址。

相关内容

  • 没有找到相关文章

最新更新