从列中提取特定字符串,并将它们按顺序放置



我有一个这样的数据帧:

df = [{'id': 1, 'id1': '859A;'},
{'id': 2, 'id1': '209A/229A/509A;'},
{'id': 3, 'id1': '(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;'},
{'id': 4, 'id1': '111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;'},
{'id': 5, 'id1': '(651A/851A)+U17/861A;'},
]
df = spark.createDataFrame(df)

我想把";id1";列分成两列。一列只需要提取以"0"结尾的字符串;A";并将它们按"/"字符串之间。另一列需要提取剩余的字符串,并将它们放在一个单独的列中,如下所示。

取";id3"id5";以及";id2";例如,期望的输出应该是:

第1列

(105A1,11A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)   
(651A/851A,861A)                                                          
(209A,229A/509A)

第2列

+TZ+-494;
+U17;
blank

所有系列都以";1〃;并以";A";应在一组中,用逗号分隔。每一个这样的系列都应该用"/&";。

您最好使用regex。regexp_extract_all在Python API中还没有直接可用,但您可以使用expr来访问它。您还需要几个连续的聚合。

from pyspark.sql import functions as F
cols = df.columns
df = df.withColumn('_vals', F.explode(F.expr(r"regexp_extract_all(id1, '\d+A', 0)")))
df = (df
.groupBy(*cols, F.substring('_vals', 1, 1)).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), ',').alias('_vals')
).groupBy(cols).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), '/').alias('newcolumn1')
).withColumn('newcolumn1', F.format_string('(%s)', 'newcolumn1')
).withColumn('newcolumn2', F.regexp_replace('id1', r'd+A|/|(|)', ''))
)
df.show(truncate=0)
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |id |id1                                                                             |newcolumn1                                                             |newcolumn2|
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |3  |(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;|(105A,111A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)|+TZ+-494; |
# |5  |(651A/851A)+U17/861A;                                                           |(651A/851A,861A)                                                       |+U17;     |
# |2  |209A/229A/509A;                                                                 |(209A,229A/509A)                                                       |;         |
# |4  |111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;         |(111A,114A,121A,131A/201A,211A,221A,231A/651A/801A,804A,821A)          |+-Y05U17; |
# |1  |859A;                                                                           |(859A)                                                                 |;         |
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+

最新更新