Erlang equivalent of javascript codePointAt?

是否有一个erlang等效的codePointAt从js?一个得到在字节偏移开始的代码点，而不修改底层字符串/二进制?

您可以使用位语法模式匹配跳过前N个字节，并将剩余字节中的第一个字符解码为UTF-8:

1> CodePointAt = fun(Binary, Offset) ->
<<_:Offset/binary, Char/utf8, _/binary>> = Binary,
Char
end.

测试:

2> CodePointAt(<<"πr²"/utf8>>, 0).
960
3> CodePointAt(<<"πr²"/utf8>>, 1).
** exception error: no match of right hand side value <<207,128,114,194,178>>
4> CodePointAt(<<"πr²"/utf8>>, 2).
114
5> CodePointAt(<<"πr²"/utf8>>, 3).
178
6> CodePointAt(<<"πr²"/utf8>>, 4).
** exception error: no match of right hand side value <<207,128,114,194,178>>
7> CodePointAt(<<"πr²"/utf8>>, 5).
** exception error: no match of right hand side value <<207,128,114,194,178>>

如您所见，如果偏移量不在有效的UTF-8字符边界内，该函数将抛出错误。如果需要，您可以使用case表达式以不同的方式处理。

首先，记住在Erlang中只有二进制字符串使用UTF-8。普通的双引号字符串已经只是代码点的列表(很像UTF-32)。chardata()类型表示这两种类型的字符串，包括像["Hello", $s, [<<"Filip"/utf8>>, $!]]这样的混合列表。如果需要，您可以使用unicode:characters_to_list(Chardata)或unicode:characters_to_binary(Chardata)来获得一个扁平的版本。

同时，JS的codePointAt函数在UTF-16编码的字符串上工作，这就是JavaScript使用的。注意，本例中的索引不是字节位置，而是编码的16位单元的索引。UTF-16也是一种可变长度编码:需要超过16位的码点使用一种称为"代理对"的转义序列;-例如表情符号👍-所以如果这样的字符可以出现，索引是误导:在"a👍z"(在JavaScript中)，a是在0，但z不是在2，而是在3。

你想要的可能就是所谓的"字形集群"。-那些在打印时看起来像一个单一的东西(参见Erlang的字符串模块的文档:https://www.erlang.org/doc/man/string.html)。而且你不能真正使用数字索引从字符串中挖掘字母簇——你需要从字符串开始迭代，一次取出一个。这可以用string:next_grapheme(Chardata)完成(参见https://www.erlang.org/doc/man/string.html#next_grapheme-1)，或者如果出于某种原因您确实需要对它们进行数字索引，您可以在数组中插入单个集群子字符串(参见https://www.erlang.org/doc/man/array.html)。例如:array:from_list(string:to_graphemes(Chardata)).

相关内容

最新更新

热门标签：