处理c#中不可编码的字符



给定一个输入字符串和一个编码,我想按如下方式处理输入字符串中的每个字符:

  • 如果代码点可以编码,则对其进行编码;

  • 如果不是,输出(编码)字符串&#xUUUU;,其中UUUU是Unicode码点的十六进制值。

我已经阅读了。net文档的EncoderEncoderFallback,我可以看到如何得到通知,当一个不可编码的字符被发现,但我看不出任何方式输出的东西,实际上取决于特定字符的问题。

任何想法?

看得更深一点(谢谢@JosefZ),我看到EncoderFallback类的描述说它支持三种机制,包括:

Best-fit fallback,映射无效的Unicode字符编码成近似等效的。例如,最合适的退路ASCIIEncoding类的处理程序可能会将Æ (U+00C6)映射为AE (U+0041 +)U + 0045)。还可以实现一个最合适的回退处理程序,将一个字母(如Cyrillic)音译为另一个字母(如拉丁语或罗马语)。.NET框架不提供任何公共的最合适的回退实现。

这似乎是我所追求的:所以我必须弄清楚如何编写自己的EncoderFallback实现?

您可以使用下面的EncoderFallbackEncoderFallbackBuffer来做您想做的事情

public class HexFallback : EncoderFallback
{
public override int MaxCharCount { get { return int.MaxValue; } }   // we can handle any amount of chars
public override EncoderFallbackBuffer CreateFallbackBuffer(){ return new HexFallbackBuffer(); }
}
public class HexFallbackBuffer : EncoderFallbackBuffer
{
int _currentPos;   // current position of invalid char encoding
char _charToEncode;   // first or main char
char _charToEncode2;  // lower pair of surrogate if any

public override bool Fallback(char charUnknown, int index)
{
Reset();
_charToEncode = charUnknown;   // store char
return true;
}

public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index)
{
Reset();
_charToEncode = charUnknownHigh;   // store high and low surrogates
_charToEncode2 = charUnknownLow;
return true;        
}

public override int Remaining { get { return 8 - _currentPos + (_charToEncode2 != (char)0 ? 8 : 0); } }   // 8 chars per invalid char

public override void Reset()
{
_charToEncode = (char)0;
_charToEncode2 = (char)0;
_currentPos = 0;
}

public override bool MovePrevious()   // can we move backwards in our encoding
{
if(_currentPos == 0)
return false;
_currentPos -= 1;
return true;
}

public override char GetNextChar()
{
if(_charToEncode2 != (char)0 && _currentPos == 8)   // if we have a surrogate
{
_charToEncode = _charToEncode2;   // move low surrogate to main
_charToEncode2 = (char)0;
_currentPos = 0;   // and start again
}

char result;
switch(_currentPos)
{
case 0:
result = '&';
break;
case 1:
result = '#';
break;
case 2:
result = 'x';
break;
case 3:
result = NibbleToHex(((int)_charToEncode) >> 12);   // shift 12 bits
break;
case 4:
result = NibbleToHex(((int)_charToEncode) >> 8 & 0x0F);  // shift 8 and mask the rest
break;
case 5:
result = NibbleToHex(((int)_charToEncode) >> 4 & 0x0F);  // shift 4 and mask the rest
break;
case 6:
result = NibbleToHex(((int)_charToEncode) & 0x0F); //  mask all high bits
break;
case 7:
result = ';';
break;
default:
return (char)0;
}

_currentPos++;
return result;
}

char NibbleToHex(int nibble)    // convert 4 bits to hex char
{
return (char)(
nibble < 10
? nibble + (int)'0'  // Return a character from '0' to '9'
: nibble + (int)'7'  // Return A to F
);
}
}

dotnetfiddle

你可以这样使用

var encoder = Encoding.ASCII.GetEncoder();
encoder.Fallback = new HexFallback();
var str = "Æ";
var buffer = new byte[1000];
var length = encoder.GetBytes(str.ToCharArray(), 0, str.Length, buffer, 0, true);
// write out encoded string
Console.WriteLine(Encoding.ASCII.GetString(buffer, 0, length));

最新更新