使用spark.sql parse_URL()从包含大括号或管道的URL中提取HOST



我需要从数百万个URL中提取HOST。有些URL格式不正确,返回NULL。在许多情况下,我会看到大括号({}(或管道(|(导致问题,而在其他情况下,则会看到多个哈希(#(字符导致问题。

以下是我的代码,其中包含我需要解析的URL:

val b = Seq(
("https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}"),
("https://example.com/test.aspx?names=John|Peter"),
("https://example.com/#/test.aspx?help=John#top"),
("https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12"),
).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT parse_url(`url_col`, 'HOST') as HOST, url_col from temp").show(false)

预期输出:

+-----------+------------------------------------------------------------------------+
|HOST       |url_col                                                                 |
+-----------+------------------------------------------------------------------------+
|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|example.com|https://example.com/test.aspx?names=John|Peter                          |
|example.com|https://example.com/#/test.aspx?help=John#top                           |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12  |
+-----------+------------------------------------------------------------------------+

电流输出:

+-----------+------------------------------------------------------------------------+
|HOST       |url_col                                                                 |
+-----------+------------------------------------------------------------------------+
|null       |https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|null       |https://example.com/test.aspx?names=John|Peter                          |
|null       |https://example.com/#/test.aspx?help=John#top                           |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12  |
+-----------+------------------------------------------------------------------------+

当url包含无效字符或格式错误时,有没有办法强制parse_url返回主机?或者有更好的方法吗?

您可以使用regexp_extract函数(正则表达式的示例(提取域:

spark.sql("""
SELECT  regexp_extract(url_col, "^(?:https?://)?(?:[^@n]+@)?(?:www.)?([^:/n?]+)", 1) as HOST, 
url_col 
FROM  temp
""").show(false)
//+-----------+------------------------------------------------------------------------+
//|HOST       |url_col                                                                 |
//+-----------+------------------------------------------------------------------------+
//|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
//|example.com|https://example.com/test.aspx?names=John|Peter                          |
//|example.com|https://example.com/#/test.aspx?help=John#top                           |
//|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12  |
//+-----------+------------------------------------------------------------------------+

最新更新