如何在 java 中从 Web 目录中获取标头信息



我想使用纯Java从网页中提取标题信息。例如,如果页面www.stackoverflow.com并且路径/questions则程序应从www.stackoverflow.com/questions返回http标头信息。到目前为止,我有这种方法:

private static String queryWeb(String page, String path) throws IOException {
        InetAddress requestedWebIP = InetAddress.getByName(page);
        if ((path == null) || (path.equals ("")) {
            path = "/";
        }
        try (
                Socket toWebSocket = new Socket(requestedWebIP, 80);
                BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
                BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
        ) {
            String request = "HEAD " + path + " HTTP/1.1rnrn";
            outPutStream.write(request.getBytes());
            outPutStream.flush();
            String input;
            String result = "";
            while (!(input = inputStream.readLine()).equals("")) {
                System.out.println(input);
                result = result + input + "n";
            }
            return result;
        } catch (IOException e) {
            System.out.println("An error occurred during IO");
            e.printStackTrace();
        }
        return null;
    }

这适用于没有其他路径的页面,即www.stackoverflow.com.但是,每当我尝试任何类似www.stackoverflow.com/questions的事情时,我都会在 while 循环中得到nullpointerException。使用调试器四处闲逛表明输入流为空,但仅在指定路径时再次出现。所以这有效:

HEAD / HTTP/1.1rnrn

但这不会(?

HEAD /questions HTTP/1.1rnrn

所以我假设 inpustream 是空的,因为 HEAD 命令失败,但它为什么不接受这种格式?

您缺少Host标头:

必须在所有 HTTP/1.1 请求消息中发送主机标头字段。

我已经修改了您的代码以发送Host

private static String queryWeb(String host, String path) throws IOException {
    InetAddress requestedWebIP = InetAddress.getByName(host);
    if ((path == null) || (path.equals(""))) {
        path = "/";
    }
    try (
            Socket toWebSocket = new Socket(requestedWebIP, 80);
            BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
            BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
    ) {
        String request = "HEAD " + path + " HTTP/1.1rn" +
                "Host: " + host + "rnrn";
        outPutStream.write(request.getBytes());
        outPutStream.flush();
        String input;
        String result = "";
        while (!(input = inputStream.readLine()).equals("")) {
            System.out.println(input);
            result = result + input + "n";
        }
        return result;
    } catch (IOException e) {
        System.out.println("An error occurred during IO");
        e.printStackTrace();
    }
    return null;
}

以下代码

queryWeb("example.com", "/");

返回200 OK ,而

queryWeb("example.com", "/questions");

返回404 Not Found(如预期的那样(。

www.stackoverflow.com也可以工作(它返回重定向到https版本(。

没有什么会失败,只有可怕的例外。

另请注意,

  1. 路径必须进行%转义(我省略了这个(
  2. 通常,使用一些库(如Apache HttpComponents HttpClient,google-http-client等(要容易得多(也更安全(。即使是标准URL().openConnection()也可以避免很多肮脏的工作和错误。

最新更新