在PHP中使用mbstring将Unicode引用转换为UTF-8字符



我在数据库中有一组数据,该数据已输入unicode字符,但它们被解释为字符串。也就是说,应该有一个撇号的地方,我实际上得到了u2019

所以我现在需要把它转换成它的字符表示,即。首先,很容易将字符串更改为其实体版本:’,然后我需要将其转换为正确的UTF-8多字节字符串。

我已经尝试了很多方法来做到这一点;在本地服务器上,我可以使用preg_match函数提取字符,然后将每个字符传递给以下函数:

mb_convert_encoding($string, "UTF-8", "HTML-ENTITIES");

听起来很合理,并且没有问题。在浏览器中关闭UTF-8字符集表明,当浏览器默认编码读取时,它实际上已转换为’

但是,在我的生产环境中运行完全相同的代码时,当呈现为UTF-8时,会产生可怕的"缺少符号"框。关闭UTF-8,它产生的字节流呈现为ò°‘£。它似乎输出4字节而不是3,我不知道这是否相关,因为我不太了解字符编码。

我认为问题是我的mbstring设置。以下是我本地服务器的mbstring设置:

Multibyte Support   enabled
Multibyte string engine libmbfl
HTTP input encoding translation disabled
Multibyte (japanese) regex support  enabled
Multibyte regex (oniguruma) version 4.7.1
mbstring.detect_order   no value    no value
mbstring.encoding_translation   Off Off
mbstring.func_overload  0   0
mbstring.http_input auto    auto
mbstring.http_output    UTF-8   UTF-8
mbstring.http_output_conv_mimetypes ^(text/|application/xhtml+xml)^(text/|application/xhtml+xml)
mbstring.internal_encoding  UTF-8   UTF-8
mbstring.language   neutral neutral
mbstring.strict_detection   Off Off
mbstring.substitute_character   no value    no value

在我的生产环境中有一些不同:

Multibyte Support   enabled
Multibyte string engine libmbfl
Multibyte (japanese) regex support  enabled
Multibyte regex (oniguruma) version 3.7.1
mbstring.detect_order   no value    no value
mbstring.encoding_translation   Off Off
mbstring.func_overload  0   0
mbstring.http_input auto    auto
mbstring.http_output    UTF-8   UTF-8
mbstring.internal_encoding  UTF-8   UTF-8
mbstring.language   neutral neutral
mbstring.strict_detection   Off Off
mbstring.substitute_character   no value    no value
有人看到我做错了什么吗?

我猜你正在寻找的是ordchr的多字节版本。

我写了下面的polyfill:

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}
if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}
if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}
if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

演示
echo "nGet string from numeric DEC valuen";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));
echo "nGet string from numeric HEX valuen";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));
echo "nGet numeric value of character as DEC intn";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));
echo "nGet numeric value of character as HEX stringn";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));
输出:

Get string from numeric DEC value
string(3) "我"
string(3) "好"
Get string from numeric HEX value
string(3) "我"
string(3) "好"
Get numeric value of character as DEC string
int(25105)
int(22909)
Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"

看看这是否能帮助你:

add于2012-09-19:

function ascii2hex($ascii)
{
    $hex = '';
    for ($i = 0; $i < strlen($ascii); $i++)
    {
        $byte = strtoupper(dechex(ord($ascii{$i})));
        $byte = str_repeat('0', 2 - strlen($byte)).$byte;
        $hex .= $byte." ";
    }
    return $hex;
}
function hex2ascii($hex)
{
    $ascii = '';
    $hex = str_replace(" ", "", $hex);
    for($i = 0; $i < strlen($hex); $i = $i+2)
        $ascii .= chr(hexdec(substr($hex, $i, 2)));
    return($ascii);
}

最新更新