我正在使用以下URL:http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2
我正试图将博客的名字提取为(stephania bell(。
我已经实现了以下功能来从URL中提取期望值:
def getBlogName( def decodeUrl )
{
def urlParams = this.paramsParser.parseURIToMap( URI.create( decodeUrl ) )
def temp = decodeUrl.replace( "http://www.espn.com", "" )
.replaceAll( "(/_/|\?).*", "" )
.replace( "/index", "" )
.replace( "/insider", "" )
.replace( "/post", "" )
.replace( "/tag", "" )
.replace( "/category", "" )
.replace( "/", "" )
.replace( "/blog/", "" )
def blogName = temp.replace( "/", "" )
return blogName
}
但是我遗漏了一些东西,它返回的值是blogstephania-bell
。你能帮我理解我在功能实现中缺少什么吗?或者也许还有更好的方法来做同样的事情?
不是你问的,只是为了好玩(我一开始以为这是你想要的(
@Grab('org.jsoup:jsoup:1.11.3')
import static org.jsoup.Jsoup.connect
def name = connect('http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2')
.get()
.select('.sticky-header h1 a')
.text()
assert name == 'Stephania Bell Blog'
通过Java类URL
按原样处理URL可能更有用。然后:
- 使用
getPath()
将路径提取为字符串 - 通过路径分隔符
split("/")
拆分为段 - 使用数组索引
pathSegments[2]
提取相关路径段
String plainText="http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2";
def url = plainText.toURL();
def fullPath = url.getPath();
def pathSegments = fullPath.split("/")
assert "stephania-bell" == pathSegments[2]
这种工作可以通过正则表达式轻松处理。如果我们想提取http://www.espn.com/blog/
和下一个/
之间的URL部分,那么下面的代码就可以了:
import java.util.regex.Pattern
def url = 'http://www.espn.com/blog/stephania-bell/post/_/id/3563/key-fantasy-football-injury-updates-for-week-4-2'
def pattern = Pattern.compile('^https?://www\.espn\.com/blog/([^/]+)/.*$')
def (_, blog) = (url =~ pattern)[0]
assert blog == 'stephania-bell'