如何在 scala (火花)中的特定字符串之后提取值?



我有一个带有列的数据帧:

df =

itemType                   count
it_shampoo                  5
it_books                    5
it_mm                       5
{it_mm}                     5
it_books it_books           5
{=it_books} it_books        5

我需要得到 :

itemType                   count
it_shampoo                  5
it_books                    5
it_mm                       5
it_mm                       5
it_books                    5
it_books                    5

如何提取替换it_books it_books{=it_books} it_booksit_books。项目类型将始终遵循it_

尝试正则表达式,^.*?(it_[w]+).*$itemType 并替换为第一个捕获的组$1

正则表达式

下面的正则表达式也可以工作

scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*b(S+)b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
|  replaced|count|
+----------+-----+
|it_shampoo|    5|
|  it_books|    5|
|     it_mm|    5|
|     it_mm|    5|
|  it_books|    5|
|  it_books|    5|
+----------+-----+

scala>

最新更新