如何从字符串列中提取数字



我的要求是从comment列中的注释列中检索订单号,始终以R开头。订单号应作为新列添加到表中。

输入数据:

code,id,mode,location,status,comment 
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received

预期输出:

AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856 
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463 

我已经在Spark-SQL中尝试过,我正在使用的查询如下:

val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")

但是,我会收到以下错误:

org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0

您可以使用具有定义的regexp_extract

def regexp_extract(e: Column, exp: String, groupIdx: Int): Column

(R\d{4})表示R,其次是4位。您可以通过使用有效的正则

轻松容纳任何其他情况
df.withColumn("orderId", regexp_extract($"comment", "(R\d{4})" , 1 )).show
+-----+---+-------+---------+------+--------------------+-------+
| code| id|   mode| location|status|             comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad|     D|order got delayed...|  R1657|
|FY-YT|102|Airways|    Delhi|    ND|R7856 package dam...|  R7856|
|TY-OP|103|Airways|     Pune|     D|Order number R546...|  R5463|
+-----+---+-------+---------+------+--------------------+-------+

您可以使用udf函数如下

import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)
df.withColumn("newColumn", extractString($"comment")).show(false)

comment列是 split ted,ted ted space filter r 开头的单词。head将采用从R开始过滤的第一个单词。

更新

要确保返回的字符串为订单号从r开始,其余字符串为 digits ,您可以添加其他filter

import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)

您可以根据需要编辑filter

相关内容

  • 没有找到相关文章

最新更新