如何使用Terraform在EMR上安装Spark、Hadoop ?


resource "aws_emr_cluster" "cluster" {
name          = "emr-test-arn"
release_label = "emr-4.6.0"
applications  = ["Spark", "Hadoop"]

在上面的applications参数中提到"Spark", "Hadoop",会在EMR上安装Hadoop和Spark吗?

或者它只是以某种方式"prepare"集群与Hadoop和Spark一起工作(我们应该执行额外的步骤在EMR集群上安装Hadoop和Spark)?

您可以参考官方文档中的示例,如下所示。效果很好https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/emr_cluster

resource "aws_emr_cluster" "cluster" {
name          = "emr-test-arn"
release_label = "emr-4.6.0"
applications  = ["Spark"]
additional_info = <<EOF
{
"instanceAwsClientConfiguration": {
"proxyPort": 8099,
"proxyHost": "myproxy.example.com"
}
}
EOF
termination_protection            = false
keep_job_flow_alive_when_no_steps = true
ec2_attributes {
subnet_id                         = aws_subnet.main.id
emr_managed_master_security_group = aws_security_group.sg.id
emr_managed_slave_security_group  = aws_security_group.sg.id
instance_profile                  = aws_iam_instance_profile.emr_profile.arn
}
master_instance_group {
instance_type = "m4.large"
}
core_instance_group {
instance_type  = "c4.large"
instance_count = 1
ebs_config {
size                 = "40"
type                 = "gp2"
volumes_per_instance = 1
}
bid_price = "0.30"
autoscaling_policy = <<EOF
{
"Constraints": {
"MinCapacity": 1,
"MaxCapacity": 2
},
"Rules": [
{
"Name": "ScaleOutMemoryPercentage",
"Description": "Scale out if YARNMemoryAvailablePercentage is less than 15",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": 1,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "LESS_THAN",
"EvaluationPeriods": 1,
"MetricName": "YARNMemoryAvailablePercentage",
"Namespace": "AWS/ElasticMapReduce",
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 15.0,
"Unit": "PERCENT"
}
}
}
]
}
EOF
}
ebs_root_volume_size = 100
tags = {
role = "rolename"
env  = "env"
}
bootstrap_action {
path = "s3://elasticmapreduce/bootstrap-actions/run-if"
name = "runif"
args = ["instance.isMaster=true", "echo running on master node"]
}
configurations_json = <<EOF
[
{
"Classification": "hadoop-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
}
}
],
"Properties": {}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
}
}
],
"Properties": {}
}
]
EOF
service_role = aws_iam_role.iam_emr_service_role.arn
}