在第09行下面有这样一行:WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-Alive
WARC的规格说明The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.
我一直想弄清楚full block of the record
指的是什么。是第11到16行吗?或者12到16行?还是第1行到第16行(不包括第9行)?我试过散列这些可能性,但无法得到上面的sha1 (base 32)值。
一个HTTP GET请求的WARC记录有三个部分(参考WARC规范):
- WARC报头
- HTTP请求头
- 空的有效载荷(注意:POST请求将包含非空的有效载荷)
记录的有效负载摘要是空字符串的base32编码的SHA-1。使用Linux命令行工具的证明:
$> echo -n "" | openssl dgst -binary -sha1 | base32
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC记录的格式为:
warc-record = header CRLF
block CRLF CRLF
(见WARC规范:记录模型)
"full"块应该包括所有直到最后的rnrn
。这意味着第11行到第17行。注意:HTTP GET请求也以rnrn
(后面的空行)结束:
$> cat request
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
$> tail -n2 request | hexdump -C
00000000 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 4b 65 65 70 |Connection: Keep|
00000010 2d 41 6c 69 76 65 0d 0a 0d 0a |-Alive....|
0000001a
$> cat request | openssl dgst -binary -sha1 | base32
CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ