我有一个csv日志文件,其中有一列包含由逗号分隔的文件路径列表。我想使用pyspark(或excel)将这些文件路径拆分为新行。原始数据如下:
+----------+----------------------------------------------------------------------------+
|time |message |
+----------+----------------------------------------------------------------------------+
|4-19 20:00|[info] Delete object in ['03-26/abc/123.jpg', '03-26/abc/456.jpg'] |
+----------+----------------------------------------------------------------------------+
|4-19 21:00|[info] Delete object in ['03-27/def/789.jpg', '03-27/def/012.jpg'] |
+----------+----------------------------------------------------------------------------+
我想把它转换成这样:
+-----------------+
|path |
+-----------------+
|03-26/abc/123.jpg|
+-----------------+
|03-26/abc/456.jpg|
+-----------------+
|03-27/def/789.jpg|
+-----------------+
|03-27/def/012.jpg|
+-----------------+
从message
中提取这些路径并解析
from pyspark.sql import functions as F
(df
.withColumn('paths', F.explode(F.from_json(F.regexp_extract('message', '['[^]]+]', 0), 'array<string>')))
.show(10, False)
)
+----------+------------------------------------------------------------------+-----------------+
|time |message |paths |
+----------+------------------------------------------------------------------+-----------------+
|4-19 20:00|[info] Delete object in ['03-26/abc/123.jpg', '03-26/abc/456.jpg']|03-26/abc/123.jpg|
|4-19 20:00|[info] Delete object in ['03-26/abc/123.jpg', '03-26/abc/456.jpg']|03-26/abc/456.jpg|
|4-19 21:00|[info] Delete object in ['03-27/def/789.jpg', '03-27/def/012.jpg']|03-27/def/789.jpg|
|4-19 21:00|[info] Delete object in ['03-27/def/789.jpg', '03-27/def/012.jpg']|03-27/def/012.jpg|
+----------+------------------------------------------------------------------+-----------------+