当试图将UTF8-BOM中编码的jsonlines文件作为Bytes数据直接加载到pandas数据帧中时,得到的错误"ValueError"对象没有属性"message"(当编码不同时会发生此一般错误(。我正试图使用Azure.storage.filedatalake.DataLakeFileClient从Azure Datalake Gen-2读取数据,它给了我字节数据,我正试图将这些数据直接加载到pandas数据帧中。下面给出了失败的代码段
from azure.identity import ClientSecretCredential
from azure.storage.filedatalake import DataLakeServiceClient
from io import BytesIO,StringIO
def initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id):
try:
global service_client
credential = ClientSecretCredential(tenant_id, client_id, client_secret)
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=credential)
except Exception as e:
print(e.message)
initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id)
data_folder = '/raw/data/'
file_system_client = service_client.get_file_system_client(file_system="dls")
paths = file_system_client.get_paths(path=data_folder)
directory_client = file_system_client.get_directory_client(data_folder)
file_client = directory_client.get_file_client('API_COUNTRY.json')
download = file_client.download_file()
downloaded_bytes = download.readall()
df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
display(df)
如果我尝试使用UTF-8编码,也可以使用相同的代码,如果我将UTF8-BOM jsonline写入文件并使用df = pd.read_json('country.json',lines = True,encoding = 'utf-8-sig')
加载它,那么它也可以使用。非常感谢您的帮助。
错误StackTrace
ValueError Traceback (most recent call last)
<ipython-input-13-b150d9150c5a> in <module>
31
32 downloaded_bytes = download.readall()
---> 33 df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
34 display(df)
C:Program FilesPython36libsite-packagespandasutil_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
C:Program FilesPython36libsite-packagespandasutil_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper
C:Program FilesPython36libsite-packagespandasiojson_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
616 return json_reader
617
--> 618 result = json_reader.read()
619 if should_close:
620 filepath_or_buffer.close()
C:Program FilesPython36libsite-packagespandasiojson_json.py in read(self)
751 data = ensure_str(self.data)
752 data = data.split("n")
--> 753 obj = self._get_object_parser(self._combine_lines(data))
754 else:
755 obj = self._get_object_parser(self.data)
C:Program FilesPython36libsite-packagespandasiojson_json.py in _get_object_parser(self, json)
775 obj = None
776 if typ == "frame":
--> 777 obj = FrameParser(json, **kwargs).parse()
778
779 if typ == "series" or obj is None:
C:Program FilesPython36libsite-packagespandasiojson_json.py in parse(self)
884
885 else:
--> 886 self._parse_no_numpy()
887
888 if self.obj is None:
C:Program FilesPython36libsite-packagespandasiojson_json.py in _parse_no_numpy(self)
1117 if orient == "columns":
1118 self.obj = DataFrame(
-> 1119 loads(json, precise_float=self.precise_float), dtype=None
1120 )
1121 elif orient == "split":
ValueError: Expected object or value
字节值的开头:
('b', ['0xef', '0xbb', '0xbf', '0x7b', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x22', '0x3a', '0x22', '0x41', '0x66', '0x67', '0x68', '0x61', '0x6e', '0x69', '0x73', '0x74', '0x61', '0x6e', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x32', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x33', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x47', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x5f', '0x4e', '0x55', '0x4d', '0x45', '0x52', '0x49', '0x43', '0x22', '0x3a', '0x22', '0x30', '0x30', '0x34', '0x22', '0x2c', '0x22', '0x4f', '0x46', '0x46', '0x49', '0x43', '0x49', '0x41', '0x4c', '0x5f', '0x53', '0x48', '0x4f', '0x52', '0x54', '0x5f', '0x49', '0x44', '0x45'])
它看起来像是旧Pandas版本中的一个bug。使用bb
中编码的最小JsonL字节串utf-8-sig,我尝试了:
pd.read_json(io.BytesIO(bb), lines=True, encoding='utf-8-sig') (1)
pd.read_json(io.StringIO(bb.decode('utf-8-sig')), lines=True) (2)
两者在Python 3.8 Pandas 1.2.2上运行良好,但在Python 3.6 Pandas 1.0.3上运行良好(2(,但(1(提高了ValueError: Expected object or value
这意味着解决方法很简单:在Python级别解码字节串,并使用unicode字符串馈送read_json
:
...
downloaded_bytes = download.readall()
df = pd.read_json(StringIO(downloaded_bytes.decode('utf-8-sig')),lines = True)
display(df)