获取第一个非标准英文字符的索引

我试图处理一个字符串，并把它分成两个部分，当我发现一个字符，不是标准的英语字母表。例如，This is a stríng with áccents.和i需要知道第一个或每个重音字符(í)的索引。

我认为解决方案是介于System.Text.Encoding和System.Globalization之间，但我错过了一些东西…

重要的是要知道它是否是一个带重音的字符，如果可能的话要排除空格。

void Main()
{
    var str = "This is a stríng with áccents.";
    var strBeforeFirstAccent = str.Substring(0, getIndexOfFirstCharWithAccent(str));
    Console.WriteLine(strBeforeFirstAccent);
}
int getIndexOfFirstCharWithAccent(string str){
    //Process logic
    return 13;
}

谢谢!

正则表达式[^a-zA-Z ]将查找非重读罗马字母和空格以外的字符。

var regex = new Regex("[^a-zA-Z ]");
var match = regex.Match("This is a stríng with áccents.");

将返回í

和match.Index将包含它的位置

另一个可能的解决方案(固定/改编自Cortright的答案)是枚举Unicode对。

const string input = "This is a stríng with áccents 𤭢.";
byte[] array = Encoding.Unicode.GetBytes(input);
for (int i = 0; i < array.Length; i += 2)
{
    if (((array[i]) | (array[i + 1] << 8)) > 128)
    {
        Console.WriteLine((array[i] | (array[i + 1] << 8)) + " at index " + (i / 2) + " is not within the ASCII range");
    }
}

打印一个列表，其中包含所有超出允许的ASCII值范围的数值。(我采用ASCII的原始定义为0-127)

我个人推荐David Arno的解决方案。我只是把这个作为一个潜在的选择。(如果对它进行基准测试，可以更快。同样，它也可以更易于管理。)

更新:我只是测试了它，似乎它仍然正确识别更高范围内的字符(U+10000 - U+10FFFF)为而不是被允许。事实上，这是由于代理对也在ASCII范围之外。唯一的问题是它将它们识别为两个字符对，而不是一个。输出:

237 at index 13 is not within the ASCII range
225 at index 22 is not within the ASCII range
55378 at index 30 is not within the ASCII range
57186 at index 31 is not within the ASCII range

相关内容

最新更新

热门标签：