如何在 Spark 数据帧中将值从一列交换到另一列



我有一个有 6 列的数据帧。在这里,我需要将一列值分配给另一列。需要将 ROW 列中的值放入 ItemData 列。这里所有的列都是结构类型,而不仅仅是字符串名称。

+-----+--------------------+--------------------+-------------------+--------------------------+--------------------+
|index|                 ROW|        Document    |ItemData           | noNamespaceSchemaLocation|                _xsi|
+-----+--------------------+--------------------+-------------------+--------------------------+--------------------+
|    0|[1,1,1018,17.0...   |[[,2001-12-17T09:...|            [,,,,,]|      GetItemMasterSupp...|http://www.w3.org...|
+-----+--------------------+--------------------+-------------------+--------------------------+--------------------+

我尝试将 DF 注册到临时表,然后尝试交换列,但没有帮助。

The final output should look like this 
+--------------------+-------------------+--------------------------+--------------------+
|        Document   |ItemData           | noNamespaceSchemaLocation|                _xsi|
+--------------------+-------------------+--------------------------+--------------------+
|[[,2001-12-17T09:...|  [1,1,1018,17.0...|      GetItemMasterSupp...|http://www.w3.org...|
+--------------------+-------------------+--------------------------+--------------------+

df.printschema() 这是架构

root
|-- index: long (nullable = false)
|-- ROW: struct (nullable = true)
|    |-- CLTRP: long (nullable = true)
|    |-- CORP: long (nullable = true)
|    |-- CORP_ITEM_CD: long (nullable = true)
|    |-- CTIV: double (nullable = true)
|    |-- CTLFAC: string (nullable = true)
|    |-- CTLI: long (nullable = true)
|-- DocData: struct (nullable = true)
|    |-- Document: struct (nullable = true)
|    |    |-- AltementID: string (nullable = true)
|    |    |-- Creat: string (nullable = true)
|    |    |-- DataClasion: struct (nullable = true)
|    |    |    |-- BusinessSeel: struct (nullable = true)
|    |    |    |    |-- Code: string (nullable = true)
|    |    |    |    |-- Description: string (nullable = true)
|    |    |    |-- DataCLevel: struct (nullable = true)
|    |    |    |    |-- Code: string (nullable = true)
|    |    |    |    |-- Description: string (nullable = true)
|    |    |    |-- PCaInd: string (nullable = true)
|    |    |    |-- PHtaInd: string (nullable = true)
|    |    |    |-- PPnd: string (nullable = true)
|    |-- DocumentAction: struct (nullable = true)
|    |    |-- ActionTypeCd: string (nullable = true)
|    |    |-- RecordTypeCd: string (nullable = true)
|-- ItemData: struct (nullable = true)
|    |-- CorpCd: string (nullable = true)
|    |-- CorId: string (nullable = true)
|    |-- DepId: string (nullable = true)
|    |-- DisrId: string (nullable = true)
|    |-- DivId: string (nullable = true)
|    |-- WarId: string (nullable = true)
|-- _noNamespaceSchemaLocation: string (nullable = true)
|-- _xsi: string (nullable = true)

**

  • 编辑 1:

** 更新为显示数据框创建

//XML Data Reader
val supData="Input_File/SCI_Input.xml"
val booksFileTag1 = "ROWSET"   
val dataDF = (new XmlReader()).withRowTag(booksFileTag1).xmlFile(sqlContext, supplyData).toDF()
val dataFrame1 = dataDF.withColumn("index",monotonically_increasing_id())   
// XML Schema Reader
val suppySchema="Input_File/Supply_sample.xml"
val booksFileTag = "GetItemMaster"      
val schemaDf = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, suppySchema).toDF()
val dataFrame2 = schemaDf.withColumn("index",monotonically_increasing_id())
val finalDf = dataFrame1.join(dataFrame2,"index")
finalDf.show()

Output for reference for @JXC
|-- ItemData: struct (nullable = true)
|    |-- CLTRP: long (nullable = true)
|    |-- CORP: long (nullable = true)
|    |-- CORP_ITEM_CD: long (nullable = true)
|    |-- CTIV: double (nullable = true)
|    |-- CTLFAC: string (nullable = true)
|    |-- CTLI: long (nullable = true)

您可以简单地将行列重命名为 ItemData,然后删除旧的 ItemData 列。

您可以有多种重命名列的方法:- https://sparkbyexamples.com/rename-a-column-on-spark-dataframes/

试试这个:

df = df.withColumn("ItemData", F.col("ROW")).drop("ROW")

首先,交换与重命名不同(这里已经回答了)。

如果要交换两列的值,例如col_Acol_B,请执行以下操作:

df.withColumn("col_A_", 'col_B)
.withColumn("col_B", 'col_A)
.withColumn("col_A", "col_A_")
.drop('col_A_)

最新更新