用C#将阿拉伯语单词转换为Unicode格式

我正在设计一个API，其中API用户需要以Unicode格式返回阿拉伯语文本，为此我尝试了以下操作：

public static class StringExtensions
{
public static string ToUnicodeString(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (var c in str)
{
sb.Append("\u" + ((int)c).ToString("X4"));
}
return sb.ToString();
}
}

上面代码的问题是，无论它在单词中的位置如何，它都会返回字母的unicode。

示例：假设我们有以下单词：

"õمير"包含：

因为它是单词中的第一个字母，所以写起来像"。

"م"，因为它在单词的中间，所以写起来像"ن"。

因为它在单词的中间，所以它写得像"ي"。

"ر"写得像"……ر"，因为它是单词的最后一个字母。

上面的代码返回｛'õ'，'م'，'ي'，'ر'｝的unicode，它是：

\u0633\u0645\u064A\u0631

而不是{"õ"，"م"、"ي"one_answers"ر"}，这是

\uFEB3\uFEE4\uFEF4\uFEAE

关于如何更新代码以获得正确的Unicode，有什么想法吗？

帮助链接

字符串只是Unicode代码点的序列；它不知道阿拉伯语的规则。你得到的正是你输入的数据；如果你想输出不同的数据，那就输入不同的数据！

试试这个：

Console.WriteLine("u0633u0645u064Au0631");
Console.WriteLine("u0633u0645u064Au0631".ToUnicodeString());
Console.WriteLine("uFEB3uFEE4uFEF4uFEAE");
Console.WriteLine("uFEB3uFEE4uFEF4uFEAE".ToUnicodeString());

不出所料，输出为

سمير
u0633u0645u064Au0631
ﺳﻤﻴﺮ
uFEB3uFEE4uFEF4uFEAE

这两个Unicode代码点序列在浏览器中呈现相同的代码点，但它们是不同的序列。如果你想写出第二个序列，那么不要输入第一个序列。

根据Eric的回答，我知道如何解决我的问题，我在Github上创建了一个解决方案。

你会发现一个在Windows上运行的简单工具，如果你想在项目中使用代码，那么只需复制粘贴UnicodesTable.cs和Unshaper.cs。

基本上，每个阿拉伯字母都需要一个Unicode表，然后可以使用以下扩展方法。

public static string GetUnShapedUnicode(this string original)
{
original = Regex.Unescape(original.Trim());
var words = original.Split(' ');
StringBuilder builder = new StringBuilder();
var unicodesTable = UnicodesTable.GetArabicGliphes();
foreach (var word in words)
{
string previous = null;
for (int i = 0; i < word.Length; i++)
{
string shapedUnicode = @"u" + ((int)word[i]).ToString("X4");
if (!unicodesTable.ContainsKey(shapedUnicode))
{
builder.Append(shapedUnicode);
previous = null;
continue;
}
else
{
if (i == 0 || previous == null)
{
builder.Append(unicodesTable[shapedUnicode][1]);
}
else
{
if (i == word.Length - 1)
{
if (!string.IsNullOrEmpty(previous) && unicodesTable[previous][4] == "2")
{
builder.Append(unicodesTable[shapedUnicode][0]);
}
else
builder.Append(unicodesTable[shapedUnicode][3]);
}
else
{
bool previouChar = unicodesTable[previous][4] == "2";
if (previouChar)
builder.Append(unicodesTable[shapedUnicode][1]);
else
builder.Append(unicodesTable[shapedUnicode][2]);
}
}
}
previous = shapedUnicode;
}
if (words.ToList().IndexOf(word) != words.Length - 1)
builder.Append(@"u" + ((int)' ').ToString("X4"));
}
return builder.ToString();
}

相关内容

最新更新

热门标签：