如何在hive中将ansi转换为utf8



我想在hive中使用自定义输入格式,我在这里找到代码:https://github.com/msukmanowsky/OmnitureDataFileInputFormat但是当我完成测试代码时,我发现我想在hive中解析的ftp日志文件是用"ANSI"(实际上是"GBK")编码的,所以结果不能在java控制台中正常显示。

所以你能帮我如何转换代码以确保显示正常吗,谢谢。您可以在OmnitureDataFileInputFormat中创建一个示例。代码在地址中:https://github.com/msukmanowsky/OmnitureDataFileInputFormat.

非常感谢!

以下通用UDF可用于将具有GBK字符集的字段转换为UTF-8。在对此字段进行任何操作之前,应先使用此UDF。

public class GUDFTestGBK extends GenericUDF{
private StringObjectInspector oi;
@Override
public ObjectInspector initialize(ObjectInspector [] arguments) throws  UDFArgumentException {
    if (arguments.length != 1) {
        throw new UDFArgumentLengthException(
            "The function GUDFTestGBK(s) takes exactly 1 arguments.");
    }
    converter = ObjectInspectorConverters.getConverter(arguments[0],
        PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    oi = (StringObjectInspector)arguments[0];
    return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}

@Override
public Object evaluate(DeferredObject [] arguments) throws HiveException{
    try{
        Text str = oi.getPrimitiveWritableObject(arguments[0].get());
        byte [] bytes = str.getBytes();
        String s = new String(bytes, "GBK");
        Text new_str = new Text(s.getBytes("UTF-8"));
        return new_str;
    } catch (Exception e){
        return new Text("Charset conversion failed.");
    }
}
@Override
public String getDisplayString(String[] children){
    return "GBKToUTF8( " + children[0] + " )";
}
}

最新更新