以拼花格式存储和检索非常大的数字的问题



我在存储大量数字(18位)作为拼花和检索的数字时遇到了一个奇怪的问题。我得到了不同的值。进一步深入研究,似乎只有当输入列表是None和实际值的混合时,才会出现这个问题。当列表中没有None值时,将按预期获取值。

我认为这与显示问题无关。尝试用unix命令如cat, vi编辑器等显示,所以它看起来不像一个显示问题。

代码中有两个部分,

  1. 从None和大数组合的列表中创建拼花。这就是问题所在。例如:value: 235313013750949476更改为235313013750949472,如输出所示。

  2. 从列表中创建拼花,只有大的数字,没有None值。

代码

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def get_row_list():
row_list = []
row_list.append(None)
row_list.append(235313013750949476)
row_list.append(None)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
row_list.append(None)
row_list.append(None)
return row_list
def get_row_list_with_no_none():
row_list = []
row_list.append(235313013750949476)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
return row_list
def create_parquet(row_list, col_list, parquet_filename):
df = pd.DataFrame(row_list, columns=col_list)
schema_field_list = [('tree_id', pa.int64())]
pa_schema = pa.schema(schema_field_list)
table = pa.Table.from_pandas(df, pa_schema)
pq_writer = pq.ParquetWriter(parquet_filename,
schema=pa_schema)
pq_writer.write_table(table)
pq_writer.close()
print("Parquet file [%s] created" % parquet_filename)
def main():
col_list = ['tree_id']
# Row list without any none
row_list = get_row_list_with_no_none()
print (row_list)
create_parquet(row_list, col_list, 'without_none.parquet')
# Row list with none
row_list = get_row_list()
print (row_list)
create_parquet(row_list, col_list, 'with_none.parquet')
# ==== Main code Execution =====
if __name__ == '__main__':
main()

[执行]

python test-parquet.py
[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created

(自由版)

pyarrow                  5.0.0
pandas                   1.1.5
python -v
Python 3.6.6

[以拼花为火花测试]

>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
|-- tree_id: long (nullable = true)
>>> dfwithoutnone.show(10, False)
+------------------+                                                            
|tree_id           |
+------------------+
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
+------------------+
>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8                                                                               
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.show(10, False)
+------------------+
|tree_id           |
+------------------+
|null              |
|235313013750949472|
|null              |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null              |
|null              |
+------------------+

我确实在StackOverflow中搜索了,找不到任何合适的。你能提供一些建议吗?

感谢

这个问题与Parquet无关,而是与您最初将row_list转换为pandas DataFrame有关:

row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)
>>> df
tree_id
0           NaN
1  2.353130e+17
2           NaN
3  1.353130e+17
4  9.353130e+17
5  8.353130e+17
6           NaN
7           NaN

因为缺少值,pandas创建了一个float64列。就是这个int ->对如此大的整数失去精度的浮点转换。
稍后再次将float转换为整数(当使用强制使用整数列的模式创建pyarrow Table时)将会产生稍微不同的值,正如可以在python中手动执行的那样:

>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e+17
>>> int(df.loc[1, "tree_id"])
235313013750949472

一个可能的解决方案是避免临时DataFrame。当然,这将取决于您的确切(实际)用例,但如果您从上面可重复示例中的python列表开始,您也可以创建pyarrow。表直接从这个值列表(pa.table({"tree_id": row_list}, schema=..)),这将保留Parquet文件中的确切值。

相关内容

  • 没有找到相关文章

最新更新