我正在尝试从URL中提取域。
输入:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()
预期结果:
+--------------------------------+---------------+
| raw_url | host |
+--------------------------------+---------------+
| subdomain.example.com/test.php | example.com |
| example.com | example.com |
| example.buzz | example.buzz |
| test.example.buzz | example.buzz |
| subdomain.example.co.uk | example.co.uk |
+------------------------------- +---------------+
非常感谢任何建议。
编辑:根据@AlexOtt的提示,我又近了几步。
import com.google.common.net.InternetDomainName
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
然而,我显然没有用withColumn正确地实现它。错误如下:
错误:未找到:值topPrivateDomainvar c=b.withColumn("host",callUDF("InternetDomainName.from",$"raw_url",topPrivateDomain((.show((
编辑2:
从@sarveshseri获得了一些不错的指针,在清理了一些语法错误后,以下代码能够从大多数URL中删除子域。
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName
import java.net.URL
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
//("example.buzz"),
//("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
val hostExtractUdf = org.apache.spark.sql.functions.udf {
(urlString: String) =>
val url = new URL("https://" + urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().name()
}
var c = b.select("raw_url").withColumn("HOST",
hostExtractUdf(col("raw_url")))
.show(false)
然而,它仍然没有如预期的那样发挥作用。.buzz
、.site
和.today
等较新后缀会导致以下错误:
Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz
首先需要将guava
添加到build.sbt
中的依赖项中。
libraryDependencies += "com.google.guava" % "guava" % "31.0.1-jre"
现在你可以提取主机如下,
import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import java.net.URL
import spark.implicits._
val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
val url = new URL("https://" + urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().toString
}
val b = sc.parallelize(Seq(
("a.b.com/c.php"),
("a.b.site/c.php"),
("a.b.buzz/c.php"),
("a.b.today/c.php"),
("b.com"),
("b.site"),
("b.buzz"),
("b.today"),
("a.b.buzz"),
("a.b.co.uk"),
("a.b.site")
)).toDF("raw_url")
val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))
c.show()
c.show
输出
+---------------+-------+
| raw_url| HOST|
+---------------+-------+
| a.b.com/c.php| b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
| b.com| b.com|
| b.site| b.site|
| b.buzz| b.buzz|
| b.today|b.today|
| a.b.buzz| b.buzz|
| a.b.co.uk|b.co.uk|
| a.b.site| b.site|
+---------------+-------+
也许您可以将regex与Sparkregexp_extract
和regexp_replace
内置函数一起使用。这里有一个例子:
val c = b.withColumn(
"HOST",
regexp_extract(col("raw_url"), raw"^(?:https?://)?(?:[^@n]+@)?(?:www.)?([^:/n?]+)", 1)
).withColumn(
"sub_domain",
regexp_extract(col("HOST"), raw"(.*?).(?=[^/]*..{2,5})/?.*", 1)
).withColumn(
"HOST",
expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")
c.show(false)
//+-----------------------------------+-------------+
//|raw_url |HOST |
//+-----------------------------------+-------------+
//|subdomain.example.com/test.php |example.com |
//|example.com |example.com |
//|example.buzz |example.buzz |
//|test.example.buzz |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz |domain.buzz |
//|dev.example.today |example.today|
//+-----------------------------------+-------------+
第一个从URL中提取完整的主机名(包括子域(。然后,使用这个答案中的正则表达式,我们搜索子域并将其替换为空白。
没有针对所有可能的情况进行测试,但它适用于您问题中给定的示例。