我有一个类似的数据帧
+----+----------+
|id | device |
+----+----------+
| 123| phone |
| 124| phone |
| 555| phone |
| 898| tablet |
| 999| tablet |
|1111| tv |
+----+----------+
我正在寻找一个新的列,其中设备值由类似的id关联
+----+----------+--------------+
|id | device | device_id |
+----+----------+--------------+
| 123| phone | phone_00001 |
| 124| phone | phone_00002 |
| 555| phone | phone_00003 |
| 898| tablet | tablet_00001 |
| 999| tablet | tablet_00002 |
|1111| tv | tv_00001 |
+----+----------+--------------+
在R中,它看起来像
df %>% group_by(device) %>% mutate(device_id = paste0(device, '_', sprintf("%04d", row_number())
我正在pyspark中寻找同样的东西。
类似于R的方法,根据设备分区分配行号,并使用format_string
获得所需的输出格式:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'device_id',
F.format_string(
'%s_%05d',
F.col('device'),
F.row_number().over(Window.partitionBy('device').orderBy('id'))
)
)
df2.show()
+----+------+------------+
| id|device| device_id|
+----+------+------------+
| 123| phone| phone_00001|
| 124| phone| phone_00002|
| 555| phone| phone_00003|
|1111| tv| tv_00001|
| 898|tablet|tablet_00001|
| 999|tablet|tablet_00002|
+----+------+------------+