HTML和PHP cURL响应utf-8编码问题

我在两个网站上从cURL获得HTML。

网站1:https://xperia.sony.jp/campaign/360RA/?s_tc=somc_co_ext_docomo_360RA_banner

网站2:https://www.fidelity.jp/fwe-top/?utm_source=outbrain& utm_medium = display& utm_campaign = similar-gdw& utm_content = FS001& dicbo = v1-b6eb7c5f86a6978bba74e3703a046886-00d8ad90c4cb65b2bdcc239bcccf5ec378-mnrtcytfgu4toljwgjrwgljumu4wmljzg5tgkljxgzsdgzbqmyzwenbsgy

My cURL看起来像:

$ua= "Mozilla/5.0 (X11; Linux i686; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1";     
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_FAILONERROR => true, 
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings 
CURLOPT_USERAGENT => $ua, // who am i


CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 10, // timeout on connect
CURLOPT_TIMEOUT => 10, // timeout on response
CURLOPT_MAXREDIRS => 5,
CURLOPT_FORBID_REUSE, true);

$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
//Use xPath or str_get_html($content) to parse

第一个URL以完全编码的方式打开，并显示预期的字符

Exaple: $title_string = $html->find("title",0)->plaintext shows the <title> tag text and characters well encoded

第二个URL显示方框¤ããªãããi��Ɨ�。但是当您执行utf8_decode( $title_string)时，那么这个SECOND URL将显示预期的编码良好的字符。

问题是，当您使用utf8_decode( $title_string)时，FIRST URL现在显示SQUARE BOXES。

有没有一种解决这个问题的通用方法?

我试过了

$charset=  mb_detect_encoding($str);
if( $charset=="UTF-8" ) {
return utf8_decode($str);
}
else {
return $str;
}

似乎两个字符串都被cURL编码为UTF-8。一个可以工作，另一个显示方框。

我也试过

php curl响应编码

将cURL响应编码为UTF-8时的奇怪行为

替换unicode字符

https://www.php.net/manual/en/function.mb-convert-encoding.php

我应该使用哪种字符集用于多语言网站?

法语和中文字符不正确显示

和更多

我花了很多时间来解决这个问题。欢迎提出任何意见

两个页面都是UTF-8编码的，cURL按原样返回。问题是以下处理;假设涉及libxml2，它尝试从<meta>元素中猜测编码，但如果没有，它假设ISO-8859-1。如果将UTF-8 BOM ("xEFxBBxBF")伪装成HTML，则可以强制假定为UTF-8。

正如@cmb在上面的回答中提到的，对于那些想要看到我的最终代码的完整细节的人。给你

$url = "https://stackoverflow.com/

$html = str_get_html($url);
libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML("xEFxBBxBF$html"); // This is where and how you put the BOM
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, 'og:')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);

希望它能帮助到有同样危险的人。

相关内容

最新更新

热门标签：