我有一个数据帧格式
+-------------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------------+
|datalake-performance/raw/bamboohr/bamboohr_custom_turnover_data/2020/12/10/11:15.csv |
|datalake-performance/raw/gitlab/002429d9-908c-497b-96ba-67794b31f0cd |
|datalake-performance/processed/bamboohr/employee/04-08-2020/16:23.csv |
|datalake-performance/raw/zoom/user/year=2020/month=09/day=22/a329affc-b1f5-45d1-932a-fbb13d9873d6|
+-------------------------------------------------------------------------------------------------+
我想生成一个新的列,格式如下:
newcol
[bamboohr_custom_turnover_data, 2020, 12, 10]
[]
[employee, 04-08-2020]
[user, year=2020, month=09, day=22]
对Pandas执行此操作,如下所示
df['value'].str.split('/').str[3:-1]
我已经尝试了以下PySpark,但我收到一个错误
df = df.withColumn("list", (split(col("value"), "/")))
df.select(slice(df["list"], 3, size(df["list"]) - (3 + 1)))
TypeError: Column is not iterable
如何在PySpark中通过[3:-1]
获得切片?
可以使用Spark SQL函数slice
和size
来实现切片。请注意,Spark SQL数组的索引从1开始,而不是0。
df2 = df.selectExpr("slice(split(value, '/'), 4, size(split(value, '/')) - 4) newcol")
df2.show(truncate=False)
+---------------------------------------------+
|newcol |
+---------------------------------------------+
|[bamboohr_custom_turnover_data, 2020, 12, 10]|
|[] |
|[employee, 04-08-2020] |
|[user, year=2020, month=09, day=22] |
+---------------------------------------------+
slice
函数也可以接受负索引start
,以便从末尾开始。你想要4个部分,忽略最后一个,所以从-5开始,取4:
from pyspark.sql.functions import col, split, slice
df = df.withColumn("newcol", slice(split(col("value"), "/"), -5, 4))
df.select("newcol").show(truncate=False)
#+---------------------------------------------+
#|newcol |
#+---------------------------------------------+
#|[bamboohr_custom_turnover_data, 2020, 12, 10]|
#|[] |
#|[processed, bamboohr, employee, 04-08-2020] |
#|[user, year=2020, month=09, day=22] |
#+---------------------------------------------+
您可以尝试这样做-
import pyspark.sql.functions as F
df_updated = df.withColumn("new value",df.select(F.split(df.value,"/")).rdd.flatMap(
lambda x: x[3:-1]))
其他参考-此处