我想在hive中使用自定义输入格式,我在这里找到代码:https://github.com/msukmanowsky/OmnitureDataFileInputFormat但是当我完成测试代码时,我发现我想在hive中解析的ftp日志文件是用"ANSI"(实际上是"GBK")编码的,所以结果不能在java控制台中正常显示。
所以你能帮我如何转换代码以确保显示正常吗,谢谢。您可以在OmnitureDataFileInputFormat中创建一个示例。代码在地址中:https://github.com/msukmanowsky/OmnitureDataFileInputFormat.
非常感谢!
以下通用UDF可用于将具有GBK字符集的字段转换为UTF-8。在对此字段进行任何操作之前,应先使用此UDF。
public class GUDFTestGBK extends GenericUDF{
private StringObjectInspector oi;
@Override
public ObjectInspector initialize(ObjectInspector [] arguments) throws UDFArgumentException {
if (arguments.length != 1) {
throw new UDFArgumentLengthException(
"The function GUDFTestGBK(s) takes exactly 1 arguments.");
}
converter = ObjectInspectorConverters.getConverter(arguments[0],
PrimitiveObjectInspectorFactory.writableStringObjectInspector);
oi = (StringObjectInspector)arguments[0];
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
@Override
public Object evaluate(DeferredObject [] arguments) throws HiveException{
try{
Text str = oi.getPrimitiveWritableObject(arguments[0].get());
byte [] bytes = str.getBytes();
String s = new String(bytes, "GBK");
Text new_str = new Text(s.getBytes("UTF-8"));
return new_str;
} catch (Exception e){
return new Text("Charset conversion failed.");
}
}
@Override
public String getDisplayString(String[] children){
return "GBKToUTF8( " + children[0] + " )";
}
}