Elasticsearch ingest pipeline:如何递归修改 HashMap 中的值



使用采集管道,我想遍历 HashMap 并从所有字符串值(存在下划线)中删除下划线,使键中的下划线保持不变。某些值是数组,必须进一步迭代才能执行相同的修改。

在管道中,我使用一个函数来遍历和修改 HashMap 的集合视图的值。

PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
void findReplace(Collection collection) {
collection.forEach(element -> {
if (element instanceof String) {
element.replace('_',' ');
} else {
findReplace(element);
}
return true;
})
}
Collection samples = ctx.samples;
samples.forEach(sample -> { //sample.sample_tags is a HashMap
Collection sample_tags = sample.sample_tags.values();
findReplace(sample_tags);
return true;
})
"""
}
}
]
}

模拟管道引入时,我发现字符串值未修改。我哪里出错了?

POST /_ingest/pipeline/samples/_simulate
{
"docs": [
{
"_index": "samples",
"_id": "xUSU_3UB5CXFr25x7DcC",
"_source": {
"samples": [
{
"sample_tags": {
"Entry_A": [
"A_hyphentated-sample",
"sample1"
],
"Entry_B": "A_multiple_underscore_example",
"Entry_C": [
"sample2",
"another_example_with_underscores"
],
"Entry_E": "last_example"
}
}
]
}
}
]
}
\Result
{
"docs" : [
{
"doc" : {
"_index" : "samples",
"_type" : "_doc",
"_id" : "xUSU_3UB5CXFr25x7DcC",
"_source" : {
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last_example",
"Entry_C" : [
"sample2",
"another_example_with_underscores"
],
"Entry_B" : "A_multiple_underscore_example",
"Entry_A" : [
"A_hyphentated-sample",
"sample1"
]
}
}
]
},
"_ingest" : {
"timestamp" : "2020-12-01T17:29:52.3917165Z"
}
}
}
]
}

下面是脚本的修改版本,它将处理您提供的数据:

PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
String replaceString(String value) {
return value.replace('_',' ');
}

void findReplace(Map map) {
map.keySet().forEach(key -> {
if (map[key] instanceof String) {
map[key] = replaceString(map[key]);
} else {
map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
}
});
}
ctx.samples.forEach(sample -> {
findReplace(sample.sample_tags);
return true;
});
"""
}
}
]
}

结果如下所示:

{
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last example",
"Entry_C" : [
"sample2",
"another example with underscores"
],
"Entry_B" : "A multiple underscore example",
"Entry_A" : [
"A hyphentated-sample",
"sample1"
]
}
}
]
}

您走在正确的道路上,但您正在处理值的副本,并且没有将修改后的值设置回最终从管道返回的文档上下文ctx。这意味着您需要跟踪当前的迭代索引 - 对于数组列表,以及哈希映射以及介于两者之间的所有内容 - 以便您可以在深度嵌套上下文中定位字段的位置。

下面是一个处理字符串和(仅字符串)数组列表的示例。您需要扩展它以处理哈希映射(和其他类型),然后可能将整个过程提取到一个单独的函数中。但是 AFAIK 您无法在 Java 中返回多种数据类型,因此可能具有挑战性......

PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
ArrayList samples = ctx.samples;

for (int i = 0; i < samples.size(); i++) {
def sample = samples.get(i).sample_tags;

for (def entry : sample.entrySet()) {
def key = entry.getKey();
def val = entry.getValue();
def replaced_val;

if (val instanceof String) {
replaced_val = val.replace('_',' ');
} else if (val instanceof ArrayList) {
replaced_val = new ArrayList();
for (int j = 0; j < val.length; j++) {
replaced_val.add(val[j].replace('_',' ')); 
}
} 
// else if (val instanceof HashMap) {
// do your thing
// }

// crucial part
ctx.samples[i][key] = replaced_val;
}
}
"""
}
}
]
}

最新更新