C Python API 扩展忽略 open(errors= "ignore" ),并且无论如何都会不断抛出编码异常



给定一个具有无效 UTF8 的文件 /myfiles/file_with_invalid_encoding.txt

parse this correctly
Føö»BÃ¥r
also parse this correctly

我正在使用C API中的内置Python open函数,如下所示(不包括C Python设置样板(:

const char* filepath = "/myfiles/file_with_invalid_encoding.txt";
PyObject* iomodule = PyImport_ImportModule( "builtins" );
if( iomodule == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openfunction = PyObject_GetAttrString( iomodule, "open" );
if( openfunction == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openfile = PyObject_CallFunction( openfunction, 
       "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );
if( openfile == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* iterfunction = PyObject_GetAttrString( openfile, "__iter__" );
Py_DECREF( openfunction );
if( iterfunction == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openfileresult = PyObject_CallObject( iterfunction, NULL );
Py_DECREF( iterfunction );
if( openfileresult == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* fileiterator = PyObject_GetAttrString( openfile, "__next__" );
Py_DECREF( openfileresult );
if( fileiterator == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* readline;
std::cout << "Here 1!" << std::endl;
while( ( readline = PyObject_CallObject( fileiterator, NULL ) ) != NULL ) {
    std::cout << "Here 2!" << std::endl;
    std::cout << PyUnicode_AsUTF8( readline ) << std::endl;
    Py_DECREF( readline );
}
PyErr_PrintEx(100);
PyErr_Clear();
PyObject* closefunction = PyObject_GetAttrString( openfile, "close" );
if( closefunction == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* closefileresult = PyObject_CallObject( closefunction, NULL );
Py_DECREF( closefunction );
if( closefileresult == NULL ) {
    PyErr_PrintEx(100); return;
}
Py_XDECREF( closefileresult );
Py_XDECREF( iomodule );
Py_XDECREF( openfile );
Py_XDECREF( fileiterator );

我正在调用传递 ignore 参数的 open 函数以忽略编码错误,但 Python 忽略了我,并在发现无效的 UTF8 字符时不断抛出编码异常:

Here 1!
Traceback (most recent call last):
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 26: invalid start byte

正如你在上面看到的,下面是下面,当我调用 builtins.open() 函数时,我传递的是 ignore 参数,但它没有任何影响。我也尝试将ignore更改为replace,但 C Python 无论如何都会抛出包含异常:

PyObject* openfile = PyObject_CallFunction( openfunction, 
       "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

PyObject_CallFunction(以及Py_BuildValue 等(采用描述所有参数的单一格式字符串。当你这样做时

PyObject* openfile = PyObject_CallFunction( openfunction, 
   "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );

您说了"一个字符串参数",filepath之后的所有参数都被忽略了。相反,您应该执行以下操作:

PyObject* openfile = PyObject_CallFunction( openfunction, 
   "ssiss", filepath, "r", -1, "UTF8", "ignore" );

说"5 个参数:2 个字符串和 int,还有两个字符串"。即使您选择使用其他PyObject_Call*功能之一,您也会发现以这种方式Py_BuildValue更容易使用。

我设法通过将函数PyObject_CallFunction替换为PyObject_CallFunctionObjArgs函数来修复它:

PyObject* openfile = PyObject_CallFunction( openfunction, 
       "s", filepath, "s", "r", "i", -1, "s", "UTF8", "s", "ignore" );
// -->
PyObject* filepathpy = Py_BuildValue( "s", filepath );
PyObject* openmodepy = Py_BuildValue( "s", "r" );
PyObject* buffersizepy = Py_BuildValue( "i", -1 );
PyObject* encodingpy = Py_BuildValue( "s", "UTF-8" );
PyObject* ignorepy = Py_BuildValue( "s", "ignore" );
PyObject* openfile = PyObject_CallFunctionObjArgs( openfunction, 
        filepathpy, openmodepy, buffersizepy, encodingpy, ignorepy, NULL );

C Python 喜欢的长版本:

PyObject* filepathpy = Py_BuildValue( "s", filepath );
if( filepathpy == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openmodepy = Py_BuildValue( "s", "r" );
if( openmodepy == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* buffersizepy = Py_BuildValue( "i", -1 );
if( buffersizepy == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* encodingpy = Py_BuildValue( "s", "UTF-8" );
if( encodingpy == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* ignorepy = Py_BuildValue( "s", "ignore" );
if( ignorepy == NULL ) {
    PyErr_PrintEx(100); return;
}
PyObject* openfile = PyObject_CallFunctionObjArgs( openfunction,
        filepathpy, openmodepy, buffersizepy, encodingpy, ignorepy, NULL );
Py_DECREF( filepathpy );
Py_DECREF( openmodepy );
Py_DECREF( buffersizepy );
Py_DECREF( encodingpy );
Py_DECREF( ignorepy );
if( openfile == NULL ) {
    PyErr_PrintEx(100); return;
}

最新更新