我使用ColdFusion的HtmlCleaner。在下面的代码中,我遍历节点树并查找内容节点。我想做的是能够修改节点的文本内容。
node.traverse(new TagNodeVisitor() {
public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
if (htmlNode instanceof ContentNode) {
ContentNode content = ((ContentNode) htmlNode);
String textContent = content.getContent();
}
// tells visitor to continue traversing the DOM tree
return true;
}
});
我用的例子是:
// traverse whole DOM and update images to absolute URLs
node.traverse(new TagNodeVisitor() {
public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
if (htmlNode instanceof TagNode) {
TagNode tag = (TagNode) htmlNode;
String tagName = tag.getName();
if ("img".equals(tagName)) {
String src = tag.getAttributeByName("src");
if (src != null) {
tag.setAttribute("src", Utils.fullUrl(siteUrl, src));
}
}
} else if (htmlNode instanceof CommentNode) {
CommentNode comment = ((CommentNode) htmlNode);
comment.getContent().append(" -- By HtmlCleaner");
}
// tells visitor to continue traversing the DOM tree
return true;
}
});
我不熟悉HtmlCleaner,它只执行"清洁"吗?我找不到任何方法来设置文本值。http://htmlcleaner.sourceforge.net/doc/index.html
jsoup是一个完整的HTML解析器(用Java编写),可以像使用jQuery一样处理DOM元素。我使用text() setter方法来更新文本节点。http://jsoup.org/cookbook/modifying-data/set-text
// intitial: <div></div>
div = doc.select("div").first();
div.text("five > four");
div.prepend("First ");
div.append(" Last");
// now: <div>First five > four Last</div>
关于jsoup(和ColdFusion)的更多信息:
- http://jsoup.org/
- http://www.bennadel.com/blog/2358-parsing-traversing-and-mutating-html-with-coldfusion-and-jsoup.htm
- http://www.raymondcamden.com/index.cfm/2012/4/6/jsoup-adds-jQuerylike-parsing-in-Java
我想做的是抓取html标签之间的内容,以便我可以将它们翻译成另一种语言,而不需要混淆html标签,图像等…
node.traverse(new TagNodeVisitor() {
public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
if (htmlNode instanceof ContentNode) {
ContentNode content = ((ContentNode) htmlNode);
URLConnection urlConn;
StringBuilder result = new StringBuilder();
String USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)";
String text = content.getContent();
String strUrl = "http://translate.google.com/translate_a/t?client=t&sl=#arguments.FromLanguage#&tl=#arguments.ToLanguage#&hl=#arguments.ToLanguage#&sc=2&ie=UTF-8&oe=UTF-8&oc=1&otf=1&ssel=0&tsel=0&q=" + URLEncoder.encode(text);
URL url = new URL(strUrl);
urlConn = url.openConnection();
urlConn.addRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
Reader reader = new InputStreamReader(urlConn.getInputStream(),
"utf-8");
JsonArray gRet = new Gson().fromJson(reader, JsonArray.class);
StringBuffer newContent = new StringBuffer(1000);
gRet.get(0)?.each() { el -> newContent.append(el.getAsJsonArray()?.get(0)?.getAsString()); };
tagNode.insertChildAfter(htmlNode, new ContentNode(newContent.toString()));
tagNode.removeChild(htmlNode);
}
}
});