Pig UDF -作为一组字段返回动态模式(不是元组)

在GROUP BY + FLATTEN之后，我有一个具有命名空间的数据:

DESCRIBE users;
users: {user_id: int, group_id: int, registration_timestamp: int}
users_with_namespace = FOREACH (GROUP users BY group_id) {
    first_to_latest = ORDER users BY registration_timestamp ASC;
    first_user = LIMIT first_to_latest 1;
    GENERATE FLATTEN(first_user);
};
DESCRIBE users_with_namespace;
users_with_namespace: {first_user::user_id: int, first_user::group_id: int, first_user::registration_timestamp: int}

我希望能够做这样的事情:

users = myudf.strip_namespace(users_with_namespace);

或者(因为这似乎是不可能的):

users = FOREACH (GROUP users_with_namespaceALL) 
GENERATE myudf.strip_namespace(users_with_namespace);

结果是:

> DESCRIBE users;
users: {user_id: int, registration_timestamp: int}

我已经编写了一个Jython Pig UDF，它应该删除任何名称空间的字段名，但是我似乎无法从我的UDF返回一组字段。只有一个Bag/Tuple/Single字段是可能的，这给我留下了这样的结果:

DESCRIBE users;
users: {t: (user_id: int, registration_timestamp: int)}

是否有任何方法可以省略't'并返回字段列表/集合?我的UDF是这样的:

@outputSchemaFunction("tupleSchema")
def strip_namespace(input):
    return input

@schemaFunction("tupleSchema")
def tupleSchema(input):
    fields = []
    dt = []
    for i in input.getField(0).schema.getFields():
        for field in i.schema.getFields():
            fields.append(field.alias.split("::")[-1])
            dt.append(field.type)
    return SchemaUtil.newTupleSchema(fields, dt)

到目前为止我一直用

FOREACH .. GENERATE namespace::field as field, ...

来剥离名称空间，但是这种方法对于包含许多字段的数据集来说非常繁琐。

不幸的是，你不能，至少现在不能。问题正是您所说的:现在您只能返回Tuple、Bag或单个字段。2个月前，我创建了一个JIRA问题，允许为这个场景返回多个字段，但是还没有回复…

我真的希望他们在未来实现这个，因为当你必须执行许多连接时，你最终会有更多的FOREACH语句来重命名字段，而不是实际的代码。

相关内容

最新更新

热门标签：