在Pyspark DataFrame上创建新的模式或列名

我看到了这篇文章，除了我需要使用列表更改dataframe的标题外，这有点有用，因为它很长并且随着我输入的每个数据集更改，所以我可以't真正在新列名中写出/硬编码。

ex：

df = sqlContext.read.load("./assets/"+filename, 
                          format='com.databricks.spark.csv', 
                          header='false', 
                          inferSchema='false')
devices = df.first()
metrics = df.take(2)[1]
# Adding the two header rows together as one as a way of later searching through and sorting rows
# delimiter is "..." since it doesn't occur anywhere in the data and we don't have to wory about multiple splits
header = [str(devices[i]) +"..."+ str(metrics[i]) for i in range(len(devices))]
df2 = df.toDF(header)

那么，我当然会得到这个错误：

IllegalArgumentException：U"需求失败：列数不匹配。 nold列名称（278）：

标题的长度= 278，列的数量相同。因此，真正的问题是，当我有新名称列表时，我该如何在数据范围内对标题进行非硬编码的重新命名？

我怀疑我必须以实际列表对象的形式进行输入，而是如何在每列迭代（使用SelectExpr或slealias或Alias）中进行此操作，并用一个创建几个新的DFS（不可能）一次新更新的列？（yuck）

我尝试了另一种方法。由于我想模拟硬编码列表（而不是实际列表对象），因此我使用了带有所有链接标头的字符串的exec（）语句。

注意：这将限制为255列。因此，如果您想要更多，则必须将其分解

for i in range(len(header)):
    # For the first of the column names, need to initiate the string header_str
    if i == 0:
        header_str = "'" + str(header[i])+"',"
    # For the last of the names, need a different string to close it without a comma
    elif i == len(header)-1:
        header_str = header_str + "'" + header[i] + "'"
    #For everything in the middle: just add it all together the same way
    else:
        header_str = header_str + "'" + header[i] + "',"
exec("df2 = df.toDF("+ header_str +")")

您可以通过旧列名称迭代，并将您的新列名称作为别名。做到这一点的好方法是在Python中使用函数zip。

首先，让我们创建列名列表：

old_cols = df.columns
new_cols = [str(d) + "..." + str(m) for d, m in zip(devices, metrics)]

尽管我要假设" ..."是指另一个python对象，因为" ..."不是列名称中的一个好角色序列。

最后：

df2 = df.select([df[oc].alias(nc) for oc, nc in zip(old_cols, new_cols)])

相关内容

最新更新

热门标签：