我在存储大量数字(18位)作为拼花和检索的数字时遇到了一个奇怪的问题。我得到了不同的值。进一步深入研究,似乎只有当输入列表是None和实际值的混合时,才会出现这个问题。当列表中没有None值时,将按预期获取值。
我认为这与显示问题无关。尝试用unix命令如cat, vi编辑器等显示,所以它看起来不像一个显示问题。
代码中有两个部分,
-
从None和大数组合的列表中创建拼花。这就是问题所在。例如:value: 235313013750949476更改为235313013750949472,如输出所示。
-
从列表中创建拼花,只有大的数字,没有None值。
代码
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def get_row_list():
row_list = []
row_list.append(None)
row_list.append(235313013750949476)
row_list.append(None)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
row_list.append(None)
row_list.append(None)
return row_list
def get_row_list_with_no_none():
row_list = []
row_list.append(235313013750949476)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
return row_list
def create_parquet(row_list, col_list, parquet_filename):
df = pd.DataFrame(row_list, columns=col_list)
schema_field_list = [('tree_id', pa.int64())]
pa_schema = pa.schema(schema_field_list)
table = pa.Table.from_pandas(df, pa_schema)
pq_writer = pq.ParquetWriter(parquet_filename,
schema=pa_schema)
pq_writer.write_table(table)
pq_writer.close()
print("Parquet file [%s] created" % parquet_filename)
def main():
col_list = ['tree_id']
# Row list without any none
row_list = get_row_list_with_no_none()
print (row_list)
create_parquet(row_list, col_list, 'without_none.parquet')
# Row list with none
row_list = get_row_list()
print (row_list)
create_parquet(row_list, col_list, 'with_none.parquet')
# ==== Main code Execution =====
if __name__ == '__main__':
main()
[执行]
python test-parquet.py
[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created
(自由版)
pyarrow 5.0.0
pandas 1.1.5
python -v
Python 3.6.6
[以拼花为火花测试]
>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
|-- tree_id: long (nullable = true)
>>> dfwithoutnone.show(10, False)
+------------------+
|tree_id |
+------------------+
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
+------------------+
>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.show(10, False)
+------------------+
|tree_id |
+------------------+
|null |
|235313013750949472|
|null |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null |
|null |
+------------------+
我确实在StackOverflow中搜索了,找不到任何合适的。你能提供一些建议吗?
感谢这个问题与Parquet无关,而是与您最初将row_list
转换为pandas DataFrame有关:
row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)
>>> df
tree_id
0 NaN
1 2.353130e+17
2 NaN
3 1.353130e+17
4 9.353130e+17
5 8.353130e+17
6 NaN
7 NaN
因为缺少值,pandas创建了一个float64列。就是这个int ->对如此大的整数失去精度的浮点转换。
稍后再次将float转换为整数(当使用强制使用整数列的模式创建pyarrow Table时)将会产生稍微不同的值,正如可以在python中手动执行的那样:
>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e+17
>>> int(df.loc[1, "tree_id"])
235313013750949472
一个可能的解决方案是避免临时DataFrame。当然,这将取决于您的确切(实际)用例,但如果您从上面可重复示例中的python列表开始,您也可以创建pyarrow。表直接从这个值列表(pa.table({"tree_id": row_list}, schema=..)
),这将保留Parquet文件中的确切值。