如何查找非ascii字符串中字素的频率?



我需要找到unicode编码字符串中字素的频率。考虑输入

String[] input = new String[]{"人物","Χαρακτήρες", "पात्र", "எழுத்துக்குறிகள்", "キャラクター"};

我使用字符。isUnicodeIdentifierStart(int codePoint) API,用于检查新字母是否已经开始。这是否适用于所有语言?这在某些语言中容易出错吗?还有其他更好的方法来查找Unicode字符串中字母的开始和结束吗?

import java.util.*;
class Solution {
public Map<String, Integer> findFrequency (String text) {

Map<String, Integer> counts = new HashMap<>();

int start = 0;
for (int index = 1; index < text.length(); index++) {
if ( Character.isUnicodeIdentifierStart(text.codePointAt(index)) ) {// if the current index is a valid start of a new unicode character then increase the frequency of the last seen character
String unicodeChar = text.substring(start, index);
counts.put(unicodeChar, counts.getOrDefault(unicodeChar, 0) + 1);
start = index;
}
}

String unicodeChar = text.substring(start, text.length());
counts.put(unicodeChar, counts.getOrDefault(unicodeChar, 0) + 1);

return counts;
}
}

例如第五可见信க்从"எழுத்துக்குறிகள்"。它应该作为一个来计算,而不是单独计算,当它们组合在一起形成字母க்。

使用CharSequence.codePoints()获取unicode码点流;然后分组:

Map<String, Long> frequencies =
text.codePoints()
.mapToObj(i -> new String(new int[]{i}, 0, 1)
.collect(Collectors.groupingBy(a -> a, Collectors.counting());

或者更简单:因为您想要String键,您可以简单地将字符串拆分为代码点,然后以相同的方式收集:

Map<String, Long> frequencies =
Arrays.stream(text.split(""))
.collect(Collectors.groupingBy(a -> a, Collectors.counting());

首先,几点要点:

  • 我怀疑为所有可能的情况手工编码您的需求是非常重要的。例如,Character.isUnicodeIdentifierStart()如何处理从右到左的阿拉伯文本,以及如何处理无意义的数据(即无效的Unicode?)。因此,使用现有的库来代替,这些库(希望!)已经解决了这些问题。JDK类java.text.BreakIterator应该做你想做的事情,在Oracle的Java教程中有关于它的使用的有用文档,在检测文本边界部分。
  • 此外,Unicode技术报告Unicode TEXT SEGMENTATION非常详细地介绍了如何处理字素。参见第3节Grapheme簇边界。
  • 虽然在你的问题中没有提到,但为使用Locale处理的文本指定一种语言是很重要的,因为一些边界规则是依赖于语言的。

下面的代码使用BreakIterator类计算OP中提供的样例数据的字素,加上一些阿拉伯语文本:

package graphemecounter;
import java.text.BreakIterator;
import java.util.Locale;
public class GraphemeCounter {
public static void main(String[] args) {
// Declare the texts  to be be processed.
String houseInArabic = "u0628" + "u064e" + "u064a" + "u0652" + "u067a" + "u064f";
String[] input = new String[]{"人物", "Χαρακτήρες", "पात्र", "எழுத்துக்குறிகள்", "キャラクター", "க்", houseInArabic};//

// Associate a locale with each of the texts to be processed.
Locale[] locales = new Locale[] { 
Locale.CHINESE,
new Locale.Builder().setLanguage("gr").setRegion("GR").build(),
new Locale.Builder().setLanguage("hi").setRegion("IN").build(),
new Locale.Builder().setLanguage("ta").setRegion("IN").build(),
Locale.JAPANESE,
new Locale.Builder().setLanguage("ta").setRegion("IN").build(),
new Locale.Builder().setLanguage("ar").build()
};
for (int i = 0; i < input.length; i++) {
int count = GraphemeCounter.getGraphemesFromText(locales[i], input[i]);
System.out.println("Grapheme count for [" + input[i] + "] is " + count);
System.out.println("=======================================");
}
}
public static int getGraphemesFromText(Locale loc, String text) {
System.out.println("Sample data: " + text);
BreakIterator bi = BreakIterator.getCharacterInstance(loc);
bi.setText(text);
int graphemeCount = 0;
int prev;
int next = bi.first();
while (next != BreakIterator.DONE) {
prev = next;
next = bi.next();
if (next != BreakIterator.DONE) { 
graphemeCount++;
String grapheme = text.substring(prev, next);
System.out.println("Boundary detected: prev=" + prev + ", next=" + next + ", grapheme=[" + grapheme + "]");
}
}
return graphemeCount; // Amend to return a list of graphemes instead, to get a total for each grapheme.
}
}

下面是运行该代码的输出:

run:
Sample data: 人物
Boundary detected: prev=0, next=1, grapheme=[人]
Boundary detected: prev=1, next=2, grapheme=[物]
Grapheme count for [人物] is 2
=======================================
Sample data: Χαρακτήρες
Boundary detected: prev=0, next=1, grapheme=[Χ]
Boundary detected: prev=1, next=2, grapheme=[α]
Boundary detected: prev=2, next=3, grapheme=[ρ]
Boundary detected: prev=3, next=4, grapheme=[α]
Boundary detected: prev=4, next=5, grapheme=[κ]
Boundary detected: prev=5, next=6, grapheme=[τ]
Boundary detected: prev=6, next=7, grapheme=[ή]
Boundary detected: prev=7, next=8, grapheme=[ρ]
Boundary detected: prev=8, next=9, grapheme=[ε]
Boundary detected: prev=9, next=10, grapheme=[ς]
Grapheme count for [Χαρακτήρες] is 10
=======================================
Sample data: पात्र
Boundary detected: prev=0, next=2, grapheme=[पा]
Boundary detected: prev=2, next=5, grapheme=[त्र]
Grapheme count for [पात्र] is 2
=======================================
Sample data: எழுத்துக்குறிகள்
Boundary detected: prev=0, next=1, grapheme=[எ]
Boundary detected: prev=1, next=2, grapheme=[ழ]
Boundary detected: prev=2, next=3, grapheme=[ு]
Boundary detected: prev=3, next=5, grapheme=[த்]
Boundary detected: prev=5, next=6, grapheme=[த]
Boundary detected: prev=6, next=7, grapheme=[ு]
Boundary detected: prev=7, next=9, grapheme=[க்]
Boundary detected: prev=9, next=10, grapheme=[க]
Boundary detected: prev=10, next=11, grapheme=[ு]
Boundary detected: prev=11, next=12, grapheme=[ற]
Boundary detected: prev=12, next=13, grapheme=[ி]
Boundary detected: prev=13, next=14, grapheme=[க]
Boundary detected: prev=14, next=16, grapheme=[ள்]
Grapheme count for [எழுத்துக்குறிகள்] is 13
=======================================
Sample data: キャラクター
Boundary detected: prev=0, next=1, grapheme=[キ]
Boundary detected: prev=1, next=2, grapheme=[ャ]
Boundary detected: prev=2, next=3, grapheme=[ラ]
Boundary detected: prev=3, next=4, grapheme=[ク]
Boundary detected: prev=4, next=5, grapheme=[タ]
Boundary detected: prev=5, next=6, grapheme=[ー]
Grapheme count for [キャラクター] is 6
=======================================
Sample data: க்
Boundary detected: prev=0, next=2, grapheme=[க்]
Grapheme count for [க்] is 1
=======================================
Sample data: بَيْٺُ
Boundary detected: prev=0, next=2, grapheme=[بَ]
Boundary detected: prev=2, next=4, grapheme=[يْ]
Boundary detected: prev=4, next=6, grapheme=[ٺُ]
Grapheme count for [بَيْٺُ] is 3
=======================================
BUILD SUCCESSFUL (total time: 0 seconds)

指出:

  • 我使用字体Arial Unicode MS的代码和输出。这是我能找到的唯一一个支持所有这些字母的。
  • 有其他方法可以解决这个问题,包括使用第三方库和正则表达式,但这种方法是最简单的。

相关内容

  • 没有找到相关文章

最新更新