删除Hive Joins中的重复联接列



我在Hive中表演:

select * from
  (select * from 
      (select * from A join B on A.x = B.x) t1
  join C on t1.y = C.y) t2
join D on t2.x = D.x

我无法解决X列,因为A和B都包含X列。我应该如何使用合格的名称,或者有一种方法可以将重复的列放入Hive中。

,因为表A和表B具有X列,您必须在此列中为此选择一个别名

select * from A join B on A.x = B.x   

类似这样的东西

select A.x as x1, B.x as x2, ...
from A join B on A.x = B.x

您可以做类似于以下操作的事情,但这意味着您不能在列名中使用特殊字符。

set hive.support.quoted.identifiers=none;

select * from
  (select C.*,t1.`(y)?+.+` from 
      (select A.*,B.`(x)?+.+` from A join B on A.x = B.x) t1
  join C on t1.y = C.y) t2
join D on t2.x = D.x

https://cwiki.apache.org/confluence/display/hive/languagemanual Select #languagemanualualualualialselect-regexcolumnspecification

我遇到的问题完全相同,对我的解决方案就是通过使用修改后的架构重新创建数据框来重命名重复列。这是一些示例代码:

  def renameDuplicatedColumns(df: DataFrame): DataFrame = {
    val duplicatedColumns = df.columns
      .groupBy(identity)
      .filter(_._2.length > 1)
      .keys
      .toSet
    val newIndexes = mutable.Map[String, Int]().withDefaultValue(0)
    val schema: StructType = StructType(
      df.schema
        .collect {
          case field if duplicatedColumns.contains(field.name) =>
            val idx = newIndexes(field.name)
            newIndexes.update(field.name, idx + 1)
            field.copy(name = field.name + "__" + idx)
          case field =>
            field
        }
    )
    df.sqlContext.createDataFrame(df.rdd, schema)
  }

最新更新