我对如何使用Terraform将雅典娜连接到我的胶目录数据库感到困惑。
我使用
resource "aws_glue_catalog_database" "catalog_database" {
name = "${var.glue_db_name}"
}
resource "aws_glue_crawler" "datalake_crawler" {
database_name = "${var.glue_db_name}"
name = "${var.crawler_name}"
role = "${aws_iam_role.crawler_iam_role.name}"
description = "${var.crawler_description}"
table_prefix = "${var.table_prefix}"
schedule = "${var.schedule}"
s3_target {
path = "s3://${var.data_bucket_name[0]}"
}
s3_target {
path = "s3://${var.data_bucket_name[1]}"
}
}
创建一个胶DB和爬行者来爬一个S3桶(这里只有两个),但我不知道如何将雅典娜查询服务链接到胶DB。在Athena
的Terraform文档中,似乎没有一种方法可以将雅典娜连接到胶水目录,而只是将其连接到S3桶。但是,显然,雅典娜可以与胶水整合。
我们当前的基本设置,用于具有胶水爬网一个S3存储桶,并在胶DB中创建/更新表,然后可以在雅典娜中查询该表,看起来像这样:
爬行者角色和角色政策:
- IAM角色的假设_role_policy只需要胶水作为principtal
- IAM角色策略允许采取胶水,S3和日志的行动
- 胶水动作和资源可能可以缩小到真正需要的胶水
- S3动作仅限于爬虫所需的操作
resource "aws_iam_role" "glue_crawler_role" {
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role_policy" "glue_crawler_role_policy" {
name = "analytics_glue_crawler_role_policy"
role = "${aws_iam_role.glue_crawler_role.id}"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data",
"arn:aws:s3:::analytics-product-data/*",
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
}
]
}
EOF
}
s3存储桶,胶数据库和爬行者:
resource "aws_s3_bucket" "product_bucket" {
bucket = "analytics-product-data"
acl = "private"
}
resource "aws_glue_catalog_database" "analytics_db" {
name = "inventory-analytics-db"
}
resource "aws_glue_crawler" "product_crawler" {
database_name = "${aws_glue_catalog_database.analytics_db.name}"
name = "analytics-product-crawler"
role = "${aws_iam_role.glue_crawler_role.arn}"
schedule = "cron(0 0 * * ? *)"
configuration = "{"Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } }"
schema_change_policy {
delete_behavior = "DELETE_FROM_DATABASE"
}
s3_target {
path = "s3://${aws_s3_bucket.product_bucket.bucket}/products"
}
}
我在Terraform代码中有很多错误。首先:
-
aws_athena_database
代码中的S3
存储桶参数是指查询输出的存储库 该表应构建的数据。 - 我已经设置了
aws_glue_crawler
来写入胶水数据库,而不是雅典娜DB。确实,正如马丁上面建议的那样,一旦正确设置,雅典娜就能在胶水db中看到桌子。 我没有正确的政策。最初,爬行者角色附加的唯一政策是
resource "aws_iam_role_policy_attachment" "crawler_attach" { policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole" role = "${aws_iam_role.crawler_iam_role.name}" }
设置了第二个策略后,明确允许所有
S3
访问我想爬网的所有存储库,并将该策略附加到同一爬网角色上,crawler成功运行并更新了表。
第二个政策:
resource "aws_iam_policy" "crawler_bucket_policy" {
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
},
{
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
},
{
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
},
{
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
},
{
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
},
{
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
}
]
}
EOF
}
我有信心我可以摆脱此政策中的硬编码,但我还不知道该怎么做。