我需要将pyspark DF列类型从数组转换为字符串,还需要删除方括号。这是数据框的架构。需要处理的列是货币代码和TicketAmount
>>> plan_queryDF.printSchema()
root
|-- event_type: string (nullable = true)
|-- publishedDate: string (nullable = true)
|-- plannedCustomerChoiceID: string (nullable = true)
|-- assortedCustomerChoiceID: string (nullable = true)
|-- CurrencyCode: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TicketAmount: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentPlan: boolean (nullable = true)
|-- originalPlan: boolean (nullable = true)
|-- globalId: string (nullable = true)
|-- PlanJsonData: string (nullable = true)
来自DataFrame的示例数据
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| [GBP]| [0]| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| [CNY]| [329]| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| [JPY]| [3400]| true| false|000576058003|{"httpStatus":200...|
我该怎么做?目前,我正在用Regexp_replace替换正方形括号。但是,当我处理大量数据时,这种方法会失败。
我有其他方法可以做吗?
这就是我想要的。
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| GBP| 0| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| CNY| 329| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| JPY| 3400| true| false|000576058003|{"httpStatus":200...|
您可以尝试getItem(0)
:
df
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string"))
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
最终演员串是可选的。