使用scala从URL中提取域



我正在尝试从URL中提取域。

输入:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()

预期结果:

+--------------------------------+---------------+
| raw_url                        | host          |
+--------------------------------+---------------+
| subdomain.example.com/test.php | example.com   |
| example.com                    | example.com   | 
| example.buzz                   | example.buzz  |
| test.example.buzz              | example.buzz  |
| subdomain.example.co.uk        | example.co.uk |
+------------------------------- +---------------+

非常感谢任何建议。

编辑:根据@AlexOtt的提示,我又近了几步。

import com.google.common.net.InternetDomainName
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

然而,我显然没有用withColumn正确地实现它。错误如下:

错误:未找到:值topPrivateDomainvar c=b.withColumn("host",callUDF("InternetDomainName.from",$"raw_url",topPrivateDomain((.show((

编辑2:

从@sarveshseri获得了一些不错的指针,在清理了一些语法错误后,以下代码能够从大多数URL中删除子域。

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName
import java.net.URL
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
//("example.buzz"),
//("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
val hostExtractUdf = org.apache.spark.sql.functions.udf { 
(urlString: String) =>
val url = new URL("https://" + urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().name()
}
var c = b.select("raw_url").withColumn("HOST", 
hostExtractUdf(col("raw_url")))
.show(false)

然而,它仍然没有如预期的那样发挥作用。.buzz.site.today等较新后缀会导致以下错误:

Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz

首先需要将guava添加到build.sbt中的依赖项中。

libraryDependencies += "com.google.guava" % "guava" % "31.0.1-jre"

现在你可以提取主机如下,

import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import java.net.URL
import spark.implicits._
val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
val url = new URL("https://" + urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().toString
}
val b = sc.parallelize(Seq(
("a.b.com/c.php"),
("a.b.site/c.php"),
("a.b.buzz/c.php"),
("a.b.today/c.php"),
("b.com"),
("b.site"),
("b.buzz"),
("b.today"),
("a.b.buzz"),
("a.b.co.uk"),
("a.b.site")
)).toDF("raw_url")
val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))
c.show()

c.show输出

+---------------+-------+
|        raw_url|   HOST|
+---------------+-------+
|  a.b.com/c.php|  b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
|          b.com|  b.com|
|         b.site| b.site|
|         b.buzz| b.buzz|
|        b.today|b.today|
|       a.b.buzz| b.buzz|
|      a.b.co.uk|b.co.uk|
|       a.b.site| b.site|
+---------------+-------+

也许您可以将regex与Sparkregexp_extractregexp_replace内置函数一起使用。这里有一个例子:

val c = b.withColumn(
"HOST",
regexp_extract(col("raw_url"), raw"^(?:https?://)?(?:[^@n]+@)?(?:www.)?([^:/n?]+)", 1)
).withColumn(
"sub_domain",
regexp_extract(col("HOST"), raw"(.*?).(?=[^/]*..{2,5})/?.*", 1)
).withColumn(
"HOST",
expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")
c.show(false)
//+-----------------------------------+-------------+
//|raw_url                            |HOST         |
//+-----------------------------------+-------------+
//|subdomain.example.com/test.php     |example.com  |
//|example.com                        |example.com  |
//|example.buzz                       |example.buzz |
//|test.example.buzz                  |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz              |domain.buzz  |
//|dev.example.today                  |example.today|
//+-----------------------------------+-------------+

第一个从URL中提取完整的主机名(包括子域(。然后,使用这个答案中的正则表达式,我们搜索子域并将其替换为空白。

没有针对所有可能的情况进行测试,但它适用于您问题中给定的示例。

相关内容

  • 没有找到相关文章

最新更新