我有一个带有边缘列表的rdd,comma分隔(source_url,destination_url(。我必须从source_url提取源主机。我尝试了以下代码:
val edges = links.flatMap{case (src, dst) =>
if (!src.startsWith("http://") || !src.startsWith("https://"))
{ val src_url = "http://" + src
val url = new java.net.URL(src_url)
val uri = url.getHost
scala.util.Try {
Some(uri,dst)}
.getOrElse(None)}
else
{ val src_url = src
val url = new java.net.URL(src_url)
val uri = url.getHost
scala.util.Try {
Some(uri,dst)}
.getOrElse(None)}
}
输入样本:
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/weingueter
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html
所需的输出:
www.belvini.de,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
www.belvini.de,http://www.belvini.de/weingueter
www.belvini.de,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html
运行代码时,我会得到一个例外:
Job aborted due to stage failure: Task 935 in stage 3.0 failed 4 times, most recent failure: Lost task 935.3 in stage 3.0 (TID 1883, node27.ib, executor 248):
java.net.MalformedURLException: For input string: "RC-a-shops.de"
at java.net.URL.<init>(URL.java:627)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
rdd的边缘约为100万,我正在一个集群中运行它。有人可以建议如何摆脱此例外
编辑:编辑问题,以在畸形感受感受中包括一个看起来像良好的URL。无论如何,我的答案是站立的。URL的文档表明,当URL在某种程度上无效时,它只会引发畸形感应感。更完整的输出将有助于调试此问题。
MalformedURLException - if no protocol is specified, or an unknown protocol is found, or spec is null.
看来您的src
不包括URL的协议。您需要
http://whatever.com/nlp-agm.php
不仅仅是nlp-agm.php
。
url必须是形式的
<scheme>://<authority><path>?<query>#<fragment>
需要<scheme>
。如果该方案无效或未指定,则new java.net.URL
将投掷MalformedURLException
。请参阅此处:https://docs.oracle.com/javase/7/docs/api/java/net/net/url.html.html#url(java.lang.string(
java.net.malformedurlexception:当您的字符串中的引号:
new Url(""http:www.example.com"")