如何在cpp中使用apache箭头读取多个parquet文件或目录



我是apache arrow cpp api的新手。我想读取多个parquet文件使用apache箭头cpp api,类似于什么是在apache箭头使用python api(作为一个表)。但是我没有看到任何这样的例子。我知道我可以使用

读取单个parquet文件
arrow::Status st;
arrow::MemoryPool* pool = arrow::default_memory_pool();
arrow::fs::LocalFileSystem file_system;
std::shared_ptr<arrow::io::RandomAccessFile> input = file_system.OpenInputFile("/tmp/data.parquet").ValueOrDie();
// Open Parquet file reader
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
st = parquet::arrow::OpenFile(input, pool, &arrow_reader);

如果你有任何问题请告诉我。提前感谢

这个功能叫做"datasets">

这里有一个相当完整的例子:https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/examples/arrow/dataset_parquet_scan_example.cc

该特性的c++文档在这里:https://arrow.apache.org/docs/cpp/dataset.html

我正在为烹饪书制作食谱,但我可以在这里发布一些片段。这些来自半成品:https://github.com/westonpace/arrow-cookbook/blob/feature/basic-dataset-read/cpp/code/datasets.cc

本质上你想要创建一个文件系统,并选择一些文件:

// Create a filesystem
std::shared_ptr<arrow::fs::LocalFileSystem> fs =
std::make_shared<arrow::fs::LocalFileSystem>();
// Create a file selector which describes which files are part of
// the dataset.  This selector performs a recursive search of a base
// directory which is typical with partitioned datasets.  You can also
// create a dataset from a list of one or more paths.
arrow::fs::FileSelector selector;
selector.base_dir = directory_base;
selector.recursive = true;

然后你需要创建一个数据集工厂和一个数据集:

// Create a file format which describes the format of the files.
// Here we specify we are reading parquet files.  We could pick a different format
// such as Arrow-IPC files or CSV files or we could customize the parquet format with
// additional reading & parsing options.
std::shared_ptr<arrow::dataset::ParquetFileFormat> format =
std::make_shared<arrow::dataset::ParquetFileFormat>();
// Create a partitioning factory.  A partitioning factory will be used by a dataset
// factory to infer the partitioning schema from the filenames.  All we need to specify
// is the flavor of partitioning which, in our case, is "hive".
//
// Alternatively, we could manually create a partitioning scheme from a schema.  This is
// typically not necessary for hive partitioning as inference works well.
std::shared_ptr<arrow::dataset::PartitioningFactory> partitioning_factory =
arrow::dataset::HivePartitioning::MakeFactory();
arrow::dataset::FileSystemFactoryOptions options;
options.partitioning = partitioning_factory;
// Create a dataset factory
ASSERT_OK_AND_ASSIGN(
std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory,
arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options));
// Create the dataset, this will scan the dataset directory to find all of the files
// and may scan some file metadata in order to determine the dataset schema.
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Dataset> dataset,
dataset_factory->Finish());

最后,您将需要"扫描"获取数据的数据集:

// Create a scanner
arrow::dataset::ScannerBuilder scanner_builder(dataset);
ASSERT_OK(scanner_builder.UseAsync(true));
ASSERT_OK(scanner_builder.UseThreads(true));
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Scanner> scanner,
scanner_builder.Finish());
// Scan the dataset.  There are a variety of other methods available on the scanner as
// well
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::Table> table, scanner->ToTable());
rout << "Read in a table with " << table->num_rows() << " rows and "
<< table->num_columns() << " columns";

最新更新