我的应用程序在第一次启动时将某个网站下载为HTML文件。当然,HTML文件非常混乱,所以我想用HtmlCleaner清理它,这样我就可以用Jsoup解析它。但是,如何在清理后获得新的已清理的 html 项目?
我做了一些研究,这就是我能找到的:
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode root = htmlCleaner.clean(url);
HtmlCleaner.getInnerHtml(root);
String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";
但是我看不到它在此代码中的哪个位置写入新文件?如果没有,我该如何实现它,以便删除旧文件并创建新的清理 html 文件?
你可以执行以下操作:
HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.themoscowtimes.com/";
TagNode node = cleaner.clean(new URL(siteUrl));
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
node , "cleaned.xml", "utf-8"
);
或
// serialize to html file
SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(htmlCleaner.getProperties());
serializer.writeToFile(node, "c:/temp/cleaned.html");