哪个块代表warc块摘要?



在第09行下面有这样一行:WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ

Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-Alive

WARC的规格说明The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.

我一直想弄清楚full block of the record指的是什么。是第11到16行吗?或者12到16行?还是第1行到第16行(不包括第9行)?我试过散列这些可能性,但无法得到上面的sha1 (base 32)值。

一个HTTP GET请求的WARC记录有三个部分(参考WARC规范):

  1. WARC报头
  2. HTTP请求头
  3. 空的有效载荷(注意:POST请求将包含非空的有效载荷)

记录的有效负载摘要是空字符串的base32编码的SHA-1。使用Linux命令行工具的证明:

$> echo -n "" | openssl dgst -binary -sha1 | base32
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

WARC记录的格式为:

warc-record  = header CRLF
block CRLF CRLF

(见WARC规范:记录模型)

"full"块应该包括所有直到最后的rnrn。这意味着第11行到第17行。注意:HTTP GET请求也以rnrn(后面的空行)结束:

$> cat request 
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
$> tail -n2 request | hexdump -C
00000000  43 6f 6e 6e 65 63 74 69  6f 6e 3a 20 4b 65 65 70  |Connection: Keep|
00000010  2d 41 6c 69 76 65 0d 0a  0d 0a                    |-Alive....|
0000001a
$> cat request | openssl dgst -binary -sha1 | base32
CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ

最新更新