我正在尝试在terraform中编写基于日志的警报策略。
我想在日志中出现某个消息时几乎实时地生成警报。具体来说,我想知道Composer DAG何时失败。
我成功地使用以下查询过滤器在控制台中设置了基于日志的警报:
resource.type="cloud_composer_environment"
severity="ERROR"
log_name="projects/my_project/logs/airflow-scheduler"
resource.labels.project_id="project-id"
textPayload=~"my_dag_name"
但是,我在将这个基于日志的警报策略转换为"google_monitoring_alert_policy"时遇到了麻烦。
我尝试将以下过滤条件添加到地形google_monitoring_alert_policy
:
filter = "resource.type=cloud_composer_environment AND resource.label.project_id=${var.project} AND log_name=projects/${var.project}/logs/airflow-scheduler AND severity=ERROR AND textPayload=~my_dag_name"
但是当运行terraform apply
时,我得到以下错误:
build 10-Nov-2022 12:21:00 [31mâ[0m [0m[1m[31mError: [0m[0m[1mError creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "resource.type=cloud_composer_environment AND resource.labels.project_id=my_project AND log_name=projects/my_project/logs/airflow-scheduler AND severity=ERROR AND textPayload=my_dag_name": The lefthand side of each expression must be prefixed with one of {group, metadata, metric, project, resource}.[0m
我有两个问题:
可以"log-based"警报是否要在terraform中配置?
我如何设置一个警报在地形,过滤在日志'textPayload'字段的特定字符串?
我看到你想创建一个log based metric
。
在本例中,您首先需要使用Terraform创建此log based metric
:
在json文件中配置指标的示例,logging_metrics.json
:
{
"metrics": {
"composer_dags_tasks_bigquery_errors": {
"name": "composer_dags_tasks_bigquery_errors",
"filter": "severity=ERROR AND resource.type="cloud_composer_environment" AND textPayload =~ "{taskinstance.py:.*} ERROR -.*bigquery.googleapis.com/bigquery/v2/projects"",
"description": "Metric for Cloud Composer Bigquery tasks errors.",
"metric_descriptor": {
"metric_kind": "DELTA",
"value_type": "INT64",
"labels": [
{
"key": "task_id",
"value_type": "STRING",
"description": "Task ID of current Airflow task",
"extractor": "EXTRACT(labels."task-id")"
},
{
"key": "execution_date",
"value_type": "STRING",
"description": "Execution date of the current Airflow task",
"extractor": "EXTRACT(labels."execution-date")"
}
]
}
}
}
}
该指标过滤Composer
日志中的BigQuery
错误。我在DAG
、task_id
和任务execution_date
上使用label
提取器,使这个指标基于这些参数是唯一的。
检索locals.tf
文件中的度量:
locals {
logging_metrics = jsondecode(file("${path.module}/resource/logging_metrics.json"))["metrics"]
}
resource "google_logging_metric" "logging_metrics" {
for_each = local.logging_metrics
project = var.project_id
name = "${each.value["name"]}"
filter = each.value["filter"]
description = each.value["description"]
metric_descriptor {
metric_kind = each.value["metric_descriptor"]["metric_kind"]
value_type = each.value["metric_descriptor"]["value_type"]
dynamic "labels" {
for_each = try(each.value["metric_descriptor"]["labels"], [])
content {
key = try(labels.value["key"], null)
value_type = try(labels.value["value_type"], null)
description = try(labels.value["description"], null)
}
}
}
label_extractors = {for label in try(each.value["metric_descriptor"]["labels"], []): label.key => label.extractor}
}
然后根据前面的log based metric
创建警报资源:
resource "google_monitoring_alert_policy" "alert_policy" {
project = var.project_id
display_name = "alert_name"
combiner = "..."
conditions {
display_name = "alert_name"
condition_threshold {
filter = "metric.type="logging.googleapis.com/user/composer_dags_tasks_bigquery_errors" AND resource.type="cloud_composer_environment""
...........
}
告警策略资源使用之前通过metric.type
创建的log based metric
。