为Presto和AWS S3设置独立的Hive Metastore服务

我正在一个环境中工作，在这个环境中，我将S3服务用作数据湖，而不是AWS Athena。我正在尝试设置PRESTO以能够查询S3中的数据，我知道我需要通过Hive Metastore服务将数据结构定义为Hive表。我正在将每个组件部署在Docker中，因此我想保持容器尺寸尽可能最小。我需要能够运行Metastore服务的Hive哪些组件？我实际上并不真正在乎跑步的蜂巢，只是Metastore。我可以修剪需要什么，还是为此而已已经有一个预先配置的软件包？我无法在网上找到任何没有下载Hadoop和Hive的东西。我想做的是可能的吗？

有一个解决方法，您不需要Hive即可运行Presto。但是，我没有使用像S3这样的分布式文件系统尝试过，但是代码建议它应该起作用(至少使用HDFS(。我认为值得尝试，因为您根本不需要任何新的Docker图像。

这个想法是使用内置的fileHiveMetastore。它既不记录在>中，也不建议在生产中使用，但您可以使用它。模式信息在文件系统中的数据旁边存储。显然，它具有其prons和缺点。我不知道您用例的细节，所以我不知道它是否适合您的需求。

配置：

connector.name=hive-hadoop2
hive.metastore=file
hive.metastore.catalog.dir=file:///tmp/hive_catalog
hive.metastore.user=cox

演示：

presto:tiny> create schema hive.default;
CREATE SCHEMA
presto:tiny> use hive.default;
USE
presto:default> create table t (t bigint);
CREATE TABLE
presto:default> show tables;
 Table
-------
 t
(1 row)
Query 20180223_202609_00009_iuchi, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [1 rows, 18B] [11 rows/s, 201B/s]
presto:default> insert into t (values 1);
INSERT: 1 row
Query 20180223_202616_00010_iuchi, FINISHED, 1 node
Splits: 51 total, 51 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> select * from t;
 t
---
 1
(1 row)

以上之后，我能够在计算机上找到以下内容：

/tmp/hive_catalog/
/tmp/hive_catalog/default
/tmp/hive_catalog/default/t
/tmp/hive_catalog/default/t/.prestoPermissions
/tmp/hive_catalog/default/t/.prestoPermissions/user_cox
/tmp/hive_catalog/default/t/.prestoPermissions/.user_cox.crc
/tmp/hive_catalog/default/t/.20180223_202616_00010_iuchi_79dee041-58a3-45ce-b86c-9f14e6260278.crc
/tmp/hive_catalog/default/t/.prestoSchema
/tmp/hive_catalog/default/t/20180223_202616_00010_iuchi_79dee041-58a3-45ce-b86c-9f14e6260278
/tmp/hive_catalog/default/t/..prestoSchema.crc
/tmp/hive_catalog/default/.prestoSchema
/tmp/hive_catalog/default/..prestoSchema.crc

现在可以在Apache Hive Distribution中使用独立/hive-standalone-metastore-3.0.0/。

从Hive 3.0开始，Metastore以单独的软件包的形式释放并且可以在没有其余蜂巢的情况下运行。这被称为独立模式。

默认情况下，将Metastore配置为与Hive一起使用，因此配置参数必须在此配置中更改。

metastore.task.threads.always -> org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask
metastore.expression.proxy -> org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy

链接到文档

需要仅为Metastore设置蜂巢似乎确实很麻烦。您是否考虑过使用AWS胶水数据目录？这样，您就不必管理任何东西。您可以在此处找到详细信息：https：//docs.aws.amazon.com/emr/latest/releaseguide/emr-presto-glue.html

我能够使用Presto SQL AMD HMS 3.0与AWS S3集成。如果有帮助，我做了一篇文章。https://www.linkedin.com/pulse/presto-sql-s3-abhishek-gupta

相关内容

最新更新

热门标签：