在这些vaex和pyarrow版本上:
>>> vaex.__version__
{'vaex': '4.12.0',
'vaex-core': '4.12.0',
'vaex-viz': '0.5.3',
'vaex-hdf5': '0.12.3',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.1',
'vaex-jupyter': '0.8.0',
'vaex-ml': '0.18.0'}
>>> pyarrow.__version__
8.0.0
当读取tsv文件并将其导出到arrow时,pyarrow.read_table()
无法正确加载箭头表,例如给定一个文件,例如s2t.tsv
:
$ printf "test-1nfoobarntest-1nfoobarntest-1nfoobarntest-1nfoobarn" > s
$ printf "1-bestnpoo bearn1-bestnpoo bearn1-bestnpoo bearn1-bestnpoo bearn" > t
$ paste s t > s2t.tsv
文件如下:
test-1 1-best
foobar poo bear
test-1 1-best
foobar poo bear
test-1 1-best
foobar poo bear
test-1 1-best
foobar poo bear
当我尝试将tsv导出为箭头时,然后读回:
import vaex
import pyarrow as pa
df = vaex.from_csv('s2t.tsv', sep='t', header=None)
df.export_arrow('s2t.parquet')
pa.parquet.read_table('s2t.parquet')
它抛出以下错误:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
1 import pyarrow as pa
2
----> 3 pa.parquet.read_table('s2t.parquet')
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
2746 ignore_prefixes=ignore_prefixes,
2747 pre_buffer=pre_buffer,
-> 2748 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
2749 )
2750 except ImportError:
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
2338
2339 self._dataset = ds.FileSystemDataset(
-> 2340 [fragment], schema=schema or fragment.physical_schema,
2341 format=parquet_format,
2342 filesystem=fragment.filesystem
/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
在导出或读取镶木地板文件时,是否需要添加一些其他args/kwargs
或者导出到arrow的程序不知怎么被窃听/破坏了
根据https://github.com/vaexio/vaex/issues/2228
df.export_parquet("file.parquet")
# or
df.export("file.parquet")
将导出到可以读取的正确格式
pa.parquet.read_table("file.parquet")