如何将字符串剥离到第一个副本

我已经将一些网页转换为字符串，字符串包含以下行(以及其他代码(：

<div class="r"><a href="https://www.apple.com/ca/"
<div class="r"><a href="https://www.facebook.com/ca/"
<div class="r"><a href="https://www.utorrent.com/ca/"

但我只想去掉第一行(https://www.apple.com/ca/(中的链接，忽略其余的HTML和代码。我该怎么做？

简单的方法：

String url = input.replaceAll("(?s).*?href="(.*?)".*", "$1");

为什么这样做的关键点：

regex匹配整个输入，但捕获目标。替换是捕获(组#1(。这种方法有效地提取目标
(?s)表示"点与换行符匹配">
.*?不情愿地(因为尽可能少地输入(匹配到"href">
CCD_ 4(不情愿地(捕获了所有直到">
.*贪婪地(因为尽可能多(匹配其余部分(感谢上面的(?s)(
替补是$1——比赛中的第一组(也是唯一一组(

使用答案中提到的regex，下面给出的是使用Java regex API的解决方案：

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "<div class="r"><a href="https://www.apple.com/ca/">Hello</a>n"
+ "<div class="r"><a href="https://www.facebook.com/ca/">Hello</a>n"
+ "<div class="r"><a href="https://www.utorrent.com/ca/">Hello</a>";
String regex = "\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}

输出：

https://www.apple.com/ca/
https://www.facebook.com/ca/
https://www.utorrent.com/ca/

相关内容

最新更新

热门标签：