我有一个带有列的数据帧:
df =
itemType count
it_shampoo 5
it_books 5
it_mm 5
{it_mm} 5
it_books it_books 5
{=it_books} it_books 5
我需要得到 :
itemType count
it_shampoo 5
it_books 5
it_mm 5
it_mm 5
it_books 5
it_books 5
如何提取替换it_books it_books
,{=it_books} it_books
it_books
。项目类型将始终遵循it_
尝试正则表达式,^.*?(it_[w]+).*$
itemType 并替换为第一个捕获的组$1
。
正则表达式
下面的正则表达式也可以工作
scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*b(S+)b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
| replaced|count|
+----------+-----+
|it_shampoo| 5|
| it_books| 5|
| it_mm| 5|
| it_mm| 5|
| it_books| 5|
| it_books| 5|
+----------+-----+
scala>