我想使用纯Java从网页中提取标题信息。例如,如果页面www.stackoverflow.com
并且路径/questions
则程序应从www.stackoverflow.com/questions
返回http标头信息。到目前为止,我有这种方法:
private static String queryWeb(String page, String path) throws IOException {
InetAddress requestedWebIP = InetAddress.getByName(page);
if ((path == null) || (path.equals ("")) {
path = "/";
}
try (
Socket toWebSocket = new Socket(requestedWebIP, 80);
BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
) {
String request = "HEAD " + path + " HTTP/1.1rnrn";
outPutStream.write(request.getBytes());
outPutStream.flush();
String input;
String result = "";
while (!(input = inputStream.readLine()).equals("")) {
System.out.println(input);
result = result + input + "n";
}
return result;
} catch (IOException e) {
System.out.println("An error occurred during IO");
e.printStackTrace();
}
return null;
}
这适用于没有其他路径的页面,即www.stackoverflow.com
.但是,每当我尝试任何类似www.stackoverflow.com/questions
的事情时,我都会在 while 循环中得到nullpointerException
。使用调试器四处闲逛表明输入流为空,但仅在指定路径时再次出现。所以这有效:
HEAD / HTTP/1.1rnrn
但这不会(?
HEAD /questions HTTP/1.1rnrn
所以我假设 inpustream 是空的,因为 HEAD 命令失败,但它为什么不接受这种格式?
您缺少Host
标头:
必须在所有 HTTP/1.1 请求消息中发送主机标头字段。
我已经修改了您的代码以发送Host
:
private static String queryWeb(String host, String path) throws IOException {
InetAddress requestedWebIP = InetAddress.getByName(host);
if ((path == null) || (path.equals(""))) {
path = "/";
}
try (
Socket toWebSocket = new Socket(requestedWebIP, 80);
BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
) {
String request = "HEAD " + path + " HTTP/1.1rn" +
"Host: " + host + "rnrn";
outPutStream.write(request.getBytes());
outPutStream.flush();
String input;
String result = "";
while (!(input = inputStream.readLine()).equals("")) {
System.out.println(input);
result = result + input + "n";
}
return result;
} catch (IOException e) {
System.out.println("An error occurred during IO");
e.printStackTrace();
}
return null;
}
以下代码
queryWeb("example.com", "/");
返回200 OK
,而
queryWeb("example.com", "/questions");
返回404 Not Found
(如预期的那样(。
www.stackoverflow.com
也可以工作(它返回重定向到https
版本(。
没有什么会失败,只有可怕的例外。
另请注意,
- 路径必须进行%转义(我省略了这个(
- 通常,使用一些库(如Apache HttpComponents HttpClient,google-http-client等(要容易得多(也更安全(。即使是标准
URL().openConnection()
也可以避免很多肮脏的工作和错误。