汇编AT&T x86 - 如何比较长字节中的特定字节？ - Assembly AT&T x86 - How to compare a specific byte in a long? 小贝子编程网

我正在汇编中编写一个函数，该函数接受一个无符号的long。此长度是UTF-8字符。

我想检查它是1、2、3还是4字节的UTF-8字符。到目前为止，我有这样的：（我已经修改了代码，以不受endianness的影响，我认为…）

movl    12(%ebp),%eax   # Move long u to %eax
movl    %eax,buff       # Move long u to buff
andl    $128,buff       # &-mask 1 MSB (from LSByte)
cmpl    $0,buff         # Compare buff to 0
je      wu8_1byte       # If 0, 1 byte UTF8
movl    12(%ebp),%eax   # Move long u to %eax
movl    %eax,buff       # Move long u to buff
andl    $0xE000,buff    # &-mask 3 MSB (from byte LSByte 2)
cmpl    $0xC000,buff    # Compare the 3 MSB to binary 110
je      wu8_2byte       # If =, 2 byte UTF8
movl    12(%ebp),%eax   # Move long u to %eax
movl    %eax,buff       # Move long u to buff
andl    $0xF00000,buff  # &-mask 4 MSB (from byte MSByte 3)
cmpl    $0xE00000,buff  # Compare the 4 MSB to binary 1110
je      wu8_3byte       # If =, 3 byte UTF8
jmp     wu8_4byte       # If no, 4 byte UTF8

12（%ebp）是我想使用的长度。Buff是一个4字节的变量。

它对1字节有效，但对其他字节无效。

关于如何弄清楚它是什么类型的UTF-8字符，有什么建议吗？

UTF-8编码：

                           0xxxxxxx    # 1 byte
                  110xxxxx 10xxxxxx    # 2 byte
         1110xxxx 10xxxxxx 10xxxxxx    # 3 byte
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    # 4 byte

由于一个简单的原因，它不应该对他们中的任何一个起作用。

您取一个32位的值并将其向右移动。然后你把它比作一种持续的遗忘，即仍然有比你所比较的更多的比特

你必须和值只采取你想要的位：

movl 12(%ebp),%eax
movl %eax,buff
shrb $13,buff #UTF8 2 byte looks like 110xxxxx 10xxxxxx
andl $7, buff # Take only the three lowest bits
cmpl $6,buff #Therefore shift 13 spaces right and check
je wu8_2byte #if buff=6 (110 = 6)

我也会在寄存器中处理它，而不是在内存位置，以使它更快。你也可以只需要一个，不需要任何变换。

根据您想做的错误检查的程度，您可以简单地使用test指令测试位。我假设unsigned long是从UTF-8编码的字节序列加载的，首先是最低有效字节，这应该与在小端机器上将char*别名为unsigned long*的结果相同。

如果这些假设是错误的，那么您可能需要相应地更改代码，而且可能会更复杂，因为您可能不知道哪个字节是前导字节。

例如

movl 12(%ebp),%eax
testl $128,%eax
jz wu8_1byte
testl $32,%eax     # We know that the top bit is set, it's not valid for it to be
                   # 10xxxxxx so we test this bit: 11?xxxxx
jz wu8_2byte
testl $16,%eax     # 111?xxxx
jz wu8_3byte
# Must be 4 byte
jmp wu8_4byte

此代码片段与您的原始代码进行了相同的假设。

movl 12(%ebp),%eax
testl $0x80,%eax
jz wu8_1byte
                     # We can assume that the last byte is of the form 10xxxxxx
testl $0x7000,%eax   # Testing this bit in byte n - 1: 1?xxxxxx
jnz wu8_2byte
testl $0x700000,%eax # Testing this bit in byte n - 2: 1?xxxxxx
jnz wu8_3byte
# Must be 4 byte
jmp wu8_4byte

我通过阅读UTF-8并找到一个更简单的解决方案来解决这个问题：

cmpl    $0x7F,12(%ebp)     # Compare unsigned long to 1 byte UTF-8 max value
jbe     wu8_1byte
cmpl    $0x7FF,12(%ebp)    # Compare unsigned long to 2 byte UTF-8 max value
jbe     wu8_2byte
cmpl    $0xFFFF,12(%ebp)   # Compare unsigned long to 3 byte UTF-8 max value
jbe     wu8_3byte
cmpl    $0xFFFFFF,12(%ebp) # Compare unsigned long to 4 byte UTF-8 max value
jbe     wu8_4byte

UTF-8字符的编码方式是，1字节字符的最大值为0x7F，2字节最大值为0x7FF，3字节最大值是0xFFFF，4字节最大值则为0xFFFFFF。因此，通过将无符号长与这些值进行比较，我可以确定解码字符所需的字节数。

汇编AT&T x86 - 如何比较长字节中的特定字节？

相关内容

最新更新

热门标签：