步骤1:我创建了一个带有两个列列A'和'列B'的DataFrame DF。类型字符串。
步骤2:我根据其索引位置创建了" B"列的新列。
我的要求:我还需要另一个A6列,而不是在索引位置上创建,而是由与yyy,xxx或yyy或yyy或zzz中的任何内容创建
val extractedDF = df
.withColumn("a1", regexp_extract($"_raw", "\[(.*?)\] \[(.*?)\]",2))
.withColumn("a2", regexp_extract($"_raw", "\[(.*?)\] \[(.*?)\] \[(.*?)\]",3))
.withColumn("a3", regexp_extract($"_raw", "\[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\]",5))
.withColumn("a4", regexp_extract($"_raw", "(?<=uvwx: )(.*?)(?=,)",1))
.withColumn("a5", regexp_extract($"_raw", "\[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\] \[(.*?)\]",13))
请帮助我!
您可以使用Regexp_replace()并提供XXX | yyy | zzz作为交替
scala> val df = Seq(("abcdef"),("axxx"),("byyypp"),("czzzr")).toDF("_raw")
df: org.apache.spark.sql.DataFrame = [_raw: string]
scala> df.show(false)
+------+
|_raw |
+------+
|abcdef|
|axxx |
|byyypp|
|czzzr |
+------+
scala> df.withColumn("a6",regexp_replace($"_raw",""".*(xxx|yyy|zzz).*""","OK")===lit("OK")).show(false)
+------+-----+
|_raw |a6 |
+------+-----+
|abcdef|false|
|axxx |true |
|byyypp|true |
|czzzr |true |
+------+-----+
scala>
如果要提取比赛,则
scala> df.withColumn("a6",regexp_extract($"_raw",""".*(xxx|yyy|zzz).*""",1)).show(false)
+------+---+
|_raw |a6 |
+------+---+
|abcdef| |
|axxx |xxx|
|byyypp|yyy|
|czzzr |zzz|
+------+---+
scala>
edit1:
scala> val df = Seq((""" [2019-03-18T02:13:20.988-05:00] [svc4_prod2_bpel_ms14] [NOTIFICATION] [] [oracle.soa.mediator.serviceEngine] [tid: [ACTIVE].ExecuteThread: '57' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 7e05e8d3-8d20-475f-a414-cb3295151c3e-0054c6b8,1:84559] [APP: soa-infra] [partition-name: DOMAIN] [tenant-name: GLOBAL] [oracle.soa.tracking.FlowId: 14436421] [oracle.soa.tracking.InstanceId: 363460793] [oracle.soa.tracking.SCAEntityId: 50139] [composite_name: DFOLOutputRouting] """)).toDF("_raw")
df: org.apache.spark.sql.DataFrame = [_raw: string]
scala> df.withColumn("a6",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):s+(S+)]""",2)).select("a6").show(false)
+-----------------+
|a6 |
+-----------------+
|DFOLOutputRouting|
+-----------------+
scala>
edit2
scala> val df = Seq((""" [2019-03-18T02:13:20.988-05:00] [svc4_prod2_bpel_ms14] [NOTIFICATION] [] [oracle.soa.mediator.serviceEngine] [tid: [ACTIVE].ExecuteThread: '57' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 7e05e8d3-8d20-475f-a414-cb3295151c3e-0054c6b8,1:84559] [APP: soa-infra] [partition-name: DOMAIN] [tenant-name: GLOBAL] [oracle.soa.tracking.FlowId: 14436421] [oracle.soa.tracking.InstanceId: 363460793] [oracle.soa.tracking.SCAEntityId: 50139] [composite_name: DFOLOutputRouting!3.20.0202.190103.1116_19] """)).toDF("_raw")
df: org.apache.spark.sql.DataFrame = [_raw: string]
scala> df.withColumn("a6",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):s+([a-zA-Z]+)""",2)).select("a6").show(false)
+-----------------+
|a6 |
+-----------------+
|DFOLOutputRouting|
+-----------------+
scala>
我想您只要获得与上面字符串匹配的结果,
您可以使用以下代码:
df.withcolumn(" a6",col(" colname")。包含(" yyy")|| col(" colname")。包含(" xxx"))