Pandas read_json() 编码 = 'utf-8-sig' 选项不适用于 BytesIO 对象(类似文件的对象)



当试图将UTF8-BOM中编码的jsonlines文件作为Bytes数据直接加载到pandas数据帧中时,得到的错误"ValueError"对象没有属性"message"(当编码不同时会发生此一般错误(。我正试图使用Azure.storage.filedatalake.DataLakeFileClient从Azure Datalake Gen-2读取数据,它给了我字节数据,我正试图将这些数据直接加载到pandas数据帧中。下面给出了失败的代码段

from azure.identity import ClientSecretCredential
from azure.storage.filedatalake import DataLakeServiceClient
from io import BytesIO,StringIO 

def initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id):

try:  
global service_client
credential = ClientSecretCredential(tenant_id, client_id, client_secret)
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=credential)

except Exception as e:
print(e.message)
initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id)
data_folder = '/raw/data/'
file_system_client = service_client.get_file_system_client(file_system="dls")
paths = file_system_client.get_paths(path=data_folder)
directory_client = file_system_client.get_directory_client(data_folder)
file_client = directory_client.get_file_client('API_COUNTRY.json')
download = file_client.download_file()
downloaded_bytes = download.readall()
df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
display(df) 

如果我尝试使用UTF-8编码,也可以使用相同的代码,如果我将UTF8-BOM jsonline写入文件并使用df = pd.read_json('country.json',lines = True,encoding = 'utf-8-sig')加载它,那么它也可以使用。非常感谢您的帮助。

错误StackTrace

ValueError                                Traceback (most recent call last)
<ipython-input-13-b150d9150c5a> in <module>
31 
32 downloaded_bytes = download.readall()
---> 33 df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
34 display(df)
C:Program FilesPython36libsite-packagespandasutil_decorators.py in wrapper(*args, **kwargs)
197                 else:
198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
200 
201         return cast(F, wrapper)
C:Program FilesPython36libsite-packagespandasutil_decorators.py in wrapper(*args, **kwargs)
294                 )
295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
297 
298         return wrapper
C:Program FilesPython36libsite-packagespandasiojson_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
616         return json_reader
617 
--> 618     result = json_reader.read()
619     if should_close:
620         filepath_or_buffer.close()
C:Program FilesPython36libsite-packagespandasiojson_json.py in read(self)
751                 data = ensure_str(self.data)
752                 data = data.split("n")
--> 753                 obj = self._get_object_parser(self._combine_lines(data))
754         else:
755             obj = self._get_object_parser(self.data)
C:Program FilesPython36libsite-packagespandasiojson_json.py in _get_object_parser(self, json)
775         obj = None
776         if typ == "frame":
--> 777             obj = FrameParser(json, **kwargs).parse()
778 
779         if typ == "series" or obj is None:
C:Program FilesPython36libsite-packagespandasiojson_json.py in parse(self)
884 
885         else:
--> 886             self._parse_no_numpy()
887 
888         if self.obj is None:
C:Program FilesPython36libsite-packagespandasiojson_json.py in _parse_no_numpy(self)
1117         if orient == "columns":
1118             self.obj = DataFrame(
-> 1119                 loads(json, precise_float=self.precise_float), dtype=None
1120             )
1121         elif orient == "split":
ValueError: Expected object or value

字节值的开头:

('b', ['0xef', '0xbb', '0xbf', '0x7b', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x22', '0x3a', '0x22', '0x41', '0x66', '0x67', '0x68', '0x61', '0x6e', '0x69', '0x73', '0x74', '0x61', '0x6e', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x32', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x33', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x47', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x5f', '0x4e', '0x55', '0x4d', '0x45', '0x52', '0x49', '0x43', '0x22', '0x3a', '0x22', '0x30', '0x30', '0x34', '0x22', '0x2c', '0x22', '0x4f', '0x46', '0x46', '0x49', '0x43', '0x49', '0x41', '0x4c', '0x5f', '0x53', '0x48', '0x4f', '0x52', '0x54', '0x5f', '0x49', '0x44', '0x45'])

它看起来像是旧Pandas版本中的一个bug。使用bb中编码的最小JsonL字节串utf-8-sig,我尝试了:

pd.read_json(io.BytesIO(bb), lines=True, encoding='utf-8-sig') (1)
pd.read_json(io.StringIO(bb.decode('utf-8-sig')), lines=True)  (2)

两者在Python 3.8 Pandas 1.2.2上运行良好,但在Python 3.6 Pandas 1.0.3上运行良好(2(,但(1(提高了ValueError: Expected object or value

这意味着解决方法很简单:在Python级别解码字节串,并使用unicode字符串馈送read_json

...
downloaded_bytes = download.readall()
df = pd.read_json(StringIO(downloaded_bytes.decode('utf-8-sig')),lines = True)
display(df) 

最新更新