解析机器人.txt使用 java 并识别是否允许 url

我目前正在应用程序中使用jsoup来解析和分析网页。但我想确保我遵守机器人.txt规则，只访问允许的页面。

我很确定 jsoup 不是为此而生的，它都是关于网络抓取和解析的。所以我计划有一个功能/模块，它应该读取域/站点的机器人.txt并确定是否允许我要访问的 url。

我做了一些研究，发现了以下内容。但我不确定这些，所以如果有人做同样的项目，其中涉及机器人.txt解析，请分享您的想法和想法，那就太好了。

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

一个迟到的答案，以防万一你 - 或其他人 - 仍在寻找一种方法来做到这一点。我在 0.2 版中使用 https://code.google.com/p/crawler-commons/，它似乎运行良好。这是我使用的代码中的简化示例：

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

显然，这与 Jsoup 没有任何关系，它只是检查是否允许抓取某个USER_AGENT的给定 URL。为了获取机器人.txt我在4.2.1版本中使用了Apache HttpClient，但这也可以用 java.net 的东西代替。

请注意，此代码仅检查允许或不允许，不考虑其他机器人.txt"抓取延迟"等功能。但是由于爬虫共享资源也提供了此功能，因此可以轻松地将其添加到上面的代码中。

以上对我不起作用。我设法把这个放在一起。我在 4 年内第一次做 Java，所以我相信这可以改进。

public static boolean robotSafe(URL url) 
{
    String strHost = url.getHost();
    String strRobot = "http://" + strHost + "/robots.txt";
    URL urlRobot;
    try { urlRobot = new URL(strRobot);
    } catch (MalformedURLException e) {
        // something weird is happening, so don't trust it
        return false;
    }
    String strCommands;
    try 
    {
        InputStream urlRobotStream = urlRobot.openStream();
        byte b[] = new byte[1000];
        int numRead = urlRobotStream.read(b);
        strCommands = new String(b, 0, numRead);
        while (numRead != -1) {
            numRead = urlRobotStream.read(b);
            if (numRead != -1) 
            {
                    String newCommands = new String(b, 0, numRead);
                    strCommands += newCommands;
            }
        }
       urlRobotStream.close();
    } 
    catch (IOException e) 
    {
        return true; // if there is no robots.txt file, it is OK to search
    }
    if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything.
    {
        String[] split = strCommands.split("n");
        ArrayList<RobotRule> robotRules = new ArrayList<>();
        String mostRecentUserAgent = null;
        for (int i = 0; i < split.length; i++) 
        {
            String line = split[i].trim();
            if (line.toLowerCase().startsWith("user-agent")) 
            {
                int start = line.indexOf(":") + 1;
                int end   = line.length();
                mostRecentUserAgent = line.substring(start, end).trim();
            }
            else if (line.startsWith(DISALLOW)) {
                if (mostRecentUserAgent != null) {
                    RobotRule r = new RobotRule();
                    r.userAgent = mostRecentUserAgent;
                    int start = line.indexOf(":") + 1;
                    int end   = line.length();
                    r.rule = line.substring(start, end).trim();
                    robotRules.add(r);
                }
            }
        }
        for (RobotRule robotRule : robotRules)
        {
            String path = url.getPath();
            if (robotRule.rule.length() == 0) return true; // allows everything if BLANK
            if (robotRule.rule == "/") return false;       // allows nothing if /
            if (robotRule.rule.length() <= path.length())
            { 
                String pathCompare = path.substring(0, robotRule.rule.length());
                if (pathCompare.equals(robotRule.rule)) return false;
            }
        }
    }
    return true;
}

您将需要帮助程序类：

/**
 *
 * @author Namhost.com
 */
public class RobotRule 
{
    public String userAgent;
    public String rule;
    RobotRule() {
    }
    @Override public String toString() 
    {
        StringBuilder result = new StringBuilder();
        String NEW_LINE = System.getProperty("line.separator");
        result.append(this.getClass().getName() + " Object {" + NEW_LINE);
        result.append("   userAgent: " + this.userAgent + NEW_LINE);
        result.append("   rule: " + this.rule + NEW_LINE);
        result.append("}");
        return result.toString();
    }    
}

相关内容

最新更新

热门标签：