Python/pyspark 数据帧重新排列列

我在python/pyspark中有一个数据框，列id time city zip等等......

现在，我向此数据框添加了一个新的列name。

现在我必须以这样一种方式排列列，使name列在id之后

我做了如下工作

change_cols = ['id', 'name']
cols = ([col for col in change_cols if col in df] 
        + [col for col in df if col not in change_cols])
df = df[cols]

我收到此错误

pyspark.sql.utils.AnalysisException: u"Reference 'id' is ambiguous, could be: id#609, id#1224.;"

为什么会发生此错误。我该如何纠正这个问题。

您可以使用select更改列的顺序：

df.select("id","name","time","city")

如果您正在使用大量列：

df.select(sorted(df.columns))

如果您只想对其中一些重新排序，同时保留其余部分而不关心它们的顺序：

def get_cols_to_front(df, columns_to_front) :
    original = df.columns
    # Filter to present columns
    columns_to_front = [c for c in columns_to_front if c in original]
    # Keep the rest of the columns and sort it for consistency
    columns_other = list(set(original) - set(columns_to_front))
    columns_other.sort()
    # Apply the order
    df = df.select(*columns_to_front, *columns_other)
    return df

相关内容

最新更新

热门标签：