pyspark按分区连接字符串



我有一个类似的数据帧

+----+----------+
|id  | device   |
+----+----------+
| 123| phone    |
| 124| phone    |
| 555| phone    |
| 898| tablet   |
| 999| tablet   |
|1111| tv       |
+----+----------+

我正在寻找一个新的列,其中设备值由类似的id关联

+----+----------+--------------+
|id  | device   | device_id    |
+----+----------+--------------+
| 123| phone    | phone_00001  |
| 124| phone    | phone_00002  |
| 555| phone    | phone_00003  |
| 898| tablet   | tablet_00001 |
| 999| tablet   | tablet_00002 |
|1111| tv       | tv_00001     |
+----+----------+--------------+

在R中,它看起来像

df %>% group_by(device) %>% mutate(device_id = paste0(device, '_', sprintf("%04d", row_number()) 

我正在pyspark中寻找同样的东西。

类似于R的方法,根据设备分区分配行号,并使用format_string获得所需的输出格式:

from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'device_id', 
F.format_string(
'%s_%05d', 
F.col('device'), 
F.row_number().over(Window.partitionBy('device').orderBy('id'))
)
)
df2.show()
+----+------+------------+
|  id|device|   device_id|
+----+------+------------+
| 123| phone| phone_00001|
| 124| phone| phone_00002|
| 555| phone| phone_00003|
|1111|    tv|    tv_00001|
| 898|tablet|tablet_00001|
| 999|tablet|tablet_00002|
+----+------+------------+

最新更新