我有这个数据帧:
val df = Seq(
("LeBron", 36, 18, 12),
("Kevin", 42, 8, 9),
("Russell", 44, 5, 14)).
toDF("player", "points", "rebounds", "assists")
df.show()
+-------+------+--------+-------+
| player|points|rebounds|assists|
+-------+------+--------+-------+
| LeBron| 36| 18| 12|
| Kevin| 42| 8| 9|
|Russell| 44| 5| 14|
+-------+------+--------+-------+
我想在除player
之外的每个列名中添加"season_high"。我还想使用一个函数来执行此操作,因为我的真实数据集有 250 列。
我想出了下面的方法,可以得到我想要的输出,但我想知道是否有办法将规则传递给 renamedColumns
映射函数,使列名player
不会切换到season_high_player
,然后使用附加的 .withColumnRenamed
函数返回player
。
val renamedColumns = df.columns.map(name => col(name).as(s"season_high_$name"))
val df2 = df.select(renamedColumns : _*).
withColumnRenamed("season_high_player", "player")
df2.show()
+-------+------------------+--------------------+-------------------+
| player|season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
| LeBron| 36| 18| 12|
| Kevin| 42| 8| 9|
|Russell| 44| 5| 14|
+-------+------------------+--------------------+-------------------+
,但他只是忘了告诉你如何使用那个"公式",所以你来了:
val selection : Seq[Column] = Seq(col("player")) ++ df.columns.filter(_ != "player")
.map(name => col(name).as(s"season_high_$name"))
df.select(selection : _*).show
// +-------+------------------+--------------------+-------------------+
// | player|season_high_points|season_high_rebounds|season_high_assists|
// +-------+------------------+--------------------+-------------------+
// | LeBron| 36| 18| 12|
// | Kevin| 42| 8| 9|
// |Russell| 44| 5| 14|
// +-------+------------------+--------------------+-------------------+
所以我们在这里所做的是过滤掉我们不需要的列名(这是普通的 scala(。然后,我们映射我们保留的列名称以将它们转换为我们重命名的列。
您可以通过将不想重命名的一列设置为第一列并应用以下逻辑来执行以下操作
import org.apache.spark.sql.functions._
val columnsRenamed = col(df.columns.head) +: df.columns.tail.map(name => col(name).as(s"season_high_$name"))
df.select(columnsRenamed :_*).show(false)
您应该获得输出为
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36 |18 |12 |
|Kevin |42 |8 |9 |
|Russell|44 |5 |14 |
+-------+------------------+--------------------+-------------------+
不依赖于字段位置的变体。
scala> val df = Seq(
| ("LeBron", 36, 18, 12),
| ("Kevin", 42, 8, 9),
| ("Russell", 44, 5, 14)).
| toDF("player", "points", "rebounds", "assists")
df: org.apache.spark.sql.DataFrame = [player: string, points: int ... 2 more fields]
scala> val newColumns = df.columns.map( x => x match { case "player" => col("player") case x => col(x).as(s"season_high_$x")} )
newColumns: Array[org.apache.spark.sql.Column] = Array(player, points AS `season_high_points`, rebounds AS `season_high_rebounds`, assists AS `season_high_assists`)
scala> df.select(newColumns:_*).show(false)
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36 |18 |12 |
|Kevin |42 |8 |9 |
|Russell|44 |5 |14 |
+-------+------------------+--------------------+-------------------+
scala>