用于替换所有非拉丁语 1 字符的 API 或方法

我正在处理第三方API/Web服务，他们只允许在XML中使用拉丁语1字符集。是否有现有的 API/方法可以查找并替换字符串中的所有非拉丁 1 字符？

例如：凯文

有没有办法让那个凯文？

使用 ICU4J，

public String removeAccents(String text) {
    return Normalizer.decompose(text, false, 0)
                 .replaceAll("\p{InCombiningDiacriticalMarks}+", "");
}

我在 http://glaforge.appspot.com/article/how-to-remove-accents-from-a-string 找到了这个例子

在 Java 1.6 中，可能内置了必要的规范化器。

我遇到过很多关于如何删除所有重音的帖子。这篇（旧！）帖子涵盖了我的用例，所以我将在这里分享我的解决方案。就我而言，我只想替换 ISO-8859-1 字符集中不存在的字符。用例是：读取 UTF-8 文件，并将其写入 ISO-8859-1 文件，同时保留尽可能多的特殊字符（但防止 UnmappableCharacterException）。

@Test
void proofOfConcept() {
    final String input = "Bełchatöw";
    final String expected = "Belchatöw";
    final String actual = MyMapper.utf8ToLatin1(input);
    Assertions.assertEquals(expected, actual);
}

规范化器似乎很有趣，但我只找到了删除所有重音的方法。

public static String utf8ToLatin1(final String input) {
    return Normalizer.normalize(input, Normalizer.Form.NFD)
        .replaceAll("\p{InCombiningDiacriticalMarks}+", "");
}

奇怪的是，上面的代码不仅失败了，而且

expected: <Belchatöw> but was: <Bełchatow>

CharsetEncoder 看起来很有趣，但似乎我只能设置一个静态的"替换"字符（实际上是：字节数组），所以所有不可映射的字符都变成"？"或类似字符

public static String utf8ToLatin1(final String input) throws CharacterCodingException {
    final ByteBuffer byteBuffer = StandardCharsets.ISO_8859_1.newEncoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE)
        .replaceWith(new byte[] { (byte) '?' })
        .encode(CharBuffer.wrap(input));
    return new String(byteBuffer.array(), StandardCharsets.ISO_8859_1);
}

失败并显示

expected: <Belchatöw> but was: <Be?chatöw>

因此，我的最终解决方案是：

public static String utf8ToLatin1(final String input) {
    final Map<String, String> characterMap = new HashMap<>();
    characterMap.put("ł", "l");
    characterMap.put("Ł", "L");
    characterMap.put("œ", "ö");
    final StringBuffer resultBuffer = new StringBuffer();
    final Matcher matcher = Pattern.compile("[^\p{InBasic_Latin}\p{InLatin-1Supplement}]").matcher(input);
    while (matcher.find()) {
        matcher.appendReplacement(resultBuffer,
            characterMap.computeIfAbsent(matcher.group(),
                s -> Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\p{InCombiningDiacriticalMarks}+", "")));
    }
    matcher.appendTail(resultBuffer);
    return resultBuffer.toString();
}

几点：

characterMap需要扩展到您的需求。该Normalizer对于重音字符很有用，但您可能还有其他字符。另外，characterMap提取出来（当心 computeIfAbsent 会更新映射，小心并发性！
Pattern.compile（）不应该重复调用，将其解压缩到静态

相关内容

最新更新

热门标签：