使用Groovy提取URL部分(博客名称)



我正在使用以下URL:http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2

我正试图将博客的名字提取为(stephania bell(。

我已经实现了以下功能来从URL中提取期望值:

def getBlogName( def decodeUrl )
{
def urlParams = this.paramsParser.parseURIToMap( URI.create( decodeUrl ) )
def temp = decodeUrl.replace( "http://www.espn.com", "" )
.replaceAll( "(/_/|\?).*", "" )
.replace( "/index", "" )
.replace( "/insider", "" )
.replace( "/post", "" )
.replace( "/tag", "" )
.replace( "/category", "" )
.replace( "/", "" )
.replace( "/blog/", "" )
def blogName = temp.replace( "/", "" )
return blogName
}

但是我遗漏了一些东西,它返回的值是blogstephania-bell。你能帮我理解我在功能实现中缺少什么吗?或者也许还有更好的方法来做同样的事情?

不是你问的,只是为了好玩(我一开始以为这是你想要的(

@Grab('org.jsoup:jsoup:1.11.3')
import static org.jsoup.Jsoup.connect
def name = connect('http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2')
.get()
.select('.sticky-header h1 a')
.text()
assert name == 'Stephania Bell Blog'

通过Java类URL按原样处理URL可能更有用。然后:

  1. 使用getPath()路径提取为字符串
  2. 通过路径分隔符split("/")拆分为
  3. 使用数组索引pathSegments[2]提取相关路径段
String plainText="http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2";

def url = plainText.toURL();
def fullPath = url.getPath();
def pathSegments = fullPath.split("/")
assert "stephania-bell" == pathSegments[2]

这种工作可以通过正则表达式轻松处理。如果我们想提取http://www.espn.com/blog/和下一个/之间的URL部分,那么下面的代码就可以了:

import java.util.regex.Pattern
def url = 'http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2'
def pattern = Pattern.compile('^https?://www\.espn\.com/blog/([^/]+)/.*$')
def (_, blog) = (url =~ pattern)[0]
assert blog == 'stephania-bell'

最新更新