我目前正在应用程序中使用jsoup来解析和分析网页。但我想确保我遵守机器人.txt规则,只访问允许的页面。
我很确定 jsoup 不是为此而生的,它都是关于网络抓取和解析的。所以我计划有一个功能/模块,它应该读取域/站点的机器人.txt并确定是否允许我要访问的 url。
我做了一些研究,发现了以下内容。但我不确定这些,所以如果有人做同样的项目,其中涉及机器人.txt解析,请分享您的想法和想法,那就太好了。
http://sourceforge.net/projects/jrobotx/
https://code.google.com/p/crawler-commons/
http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12
一个迟到的答案,以防万一你 - 或其他人 - 仍在寻找一种方法来做到这一点。我在 0.2 版中使用 https://code.google.com/p/crawler-commons/,它似乎运行良好。这是我使用的代码中的简化示例:
String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
+ (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
HttpGet httpget = new HttpGet(hostId + "/robots.txt");
HttpContext context = new BasicHttpContext();
HttpResponse response = httpclient.execute(httpget, context);
if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
// consume entity to deallocate connection
EntityUtils.consumeQuietly(response.getEntity());
} else {
BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
"text/plain", USER_AGENT);
}
robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);
显然,这与 Jsoup 没有任何关系,它只是检查是否允许抓取某个USER_AGENT的给定 URL。为了获取机器人.txt我在4.2.1版本中使用了Apache HttpClient,但这也可以用 java.net 的东西代替。
请注意,此代码仅检查允许或不允许,不考虑其他机器人.txt"抓取延迟"等功能。但是由于爬虫共享资源也提供了此功能,因此可以轻松地将其添加到上面的代码中。
以上对我不起作用。我设法把这个放在一起。我在 4 年内第一次做 Java,所以我相信这可以改进。
public static boolean robotSafe(URL url)
{
String strHost = url.getHost();
String strRobot = "http://" + strHost + "/robots.txt";
URL urlRobot;
try { urlRobot = new URL(strRobot);
} catch (MalformedURLException e) {
// something weird is happening, so don't trust it
return false;
}
String strCommands;
try
{
InputStream urlRobotStream = urlRobot.openStream();
byte b[] = new byte[1000];
int numRead = urlRobotStream.read(b);
strCommands = new String(b, 0, numRead);
while (numRead != -1) {
numRead = urlRobotStream.read(b);
if (numRead != -1)
{
String newCommands = new String(b, 0, numRead);
strCommands += newCommands;
}
}
urlRobotStream.close();
}
catch (IOException e)
{
return true; // if there is no robots.txt file, it is OK to search
}
if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything.
{
String[] split = strCommands.split("n");
ArrayList<RobotRule> robotRules = new ArrayList<>();
String mostRecentUserAgent = null;
for (int i = 0; i < split.length; i++)
{
String line = split[i].trim();
if (line.toLowerCase().startsWith("user-agent"))
{
int start = line.indexOf(":") + 1;
int end = line.length();
mostRecentUserAgent = line.substring(start, end).trim();
}
else if (line.startsWith(DISALLOW)) {
if (mostRecentUserAgent != null) {
RobotRule r = new RobotRule();
r.userAgent = mostRecentUserAgent;
int start = line.indexOf(":") + 1;
int end = line.length();
r.rule = line.substring(start, end).trim();
robotRules.add(r);
}
}
}
for (RobotRule robotRule : robotRules)
{
String path = url.getPath();
if (robotRule.rule.length() == 0) return true; // allows everything if BLANK
if (robotRule.rule == "/") return false; // allows nothing if /
if (robotRule.rule.length() <= path.length())
{
String pathCompare = path.substring(0, robotRule.rule.length());
if (pathCompare.equals(robotRule.rule)) return false;
}
}
}
return true;
}
您将需要帮助程序类:
/**
*
* @author Namhost.com
*/
public class RobotRule
{
public String userAgent;
public String rule;
RobotRule() {
}
@Override public String toString()
{
StringBuilder result = new StringBuilder();
String NEW_LINE = System.getProperty("line.separator");
result.append(this.getClass().getName() + " Object {" + NEW_LINE);
result.append(" userAgent: " + this.userAgent + NEW_LINE);
result.append(" rule: " + this.rule + NEW_LINE);
result.append("}");
return result.toString();
}
}