zip:系统自带的。Zip 3.0 (July 5th 2008), by Info-ZIP. Compiled with gcc 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31) for Unix (Mac OS X) on Oct 6 2017.
md5:系统自带的。
复现步骤
1 2 3 4 5
echo hahahaha > xx rm -f a.zip b.zip zip a.zip xx zip b.zip xx md5 a.zip b.zip
请手动一步步来执行这个步骤,不要一次性复制粘贴执行。 你会看到相同内容的 xx 文件产生的 a.zip 和 b.zip 的 md5 值是不同的。
local file header signature 4 bytes (0x04034b50) version needed to extract 2 bytes general purpose bit flag 2 bytes compression method 2 bytes last mod file time 2 bytes last mod file date 2 bytes crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes file name length 2 bytes extra field length 2 bytes
file name (variable size) extra field (variable size)
zip 最后是目录记录块
1 2 3 4
[central directory header n] [zip64 end of central directory record] [zip64 end of central directory locator] [end of central directory record]
搜索线索
然而阅读标准文档是最枯燥乏味的,我懒得把全文仔仔细细读一遍。 我猜 zip 文件变化是由于压缩打包过程中引入了跟时间相关的变量,于是查询关键词 timedate 还有 stamp。 先把文档下载到本地,然后用 ag 命令来搜索关键词。
整理一下就是 last mod file time,last access time,creation time 这三个线索。
那么来看一下 zip 会不会对文件时间进行修改。我用到了 gstat (GNU stat) 命令。(在 Mac 中调用 GNU 命令工具,需要装 brew install coreutils)
1 2 3 4 5 6 7
echo hahahaha > xx gstat xx rm -f a.zip b.zip zip a.zip xx gstat xx zip b.zip xx gstat xx
你会发现 access time 变了。再回文档找一下查询 access time 关键词的句子。找到三处。
The following is the layout of the UNIX "extra" block. Note: all fields are stored in Intel low-byte/high-byte order.
Value Size Description ----- ---- ----------- (UNIX) 0x000d 2 bytes Tag for this "extra" block type TSize 2 bytes Size for the following data block Atime 4 bytes File last access time Mtime 4 bytes File last modification time Uid 2 bytes File user ID Gid 2 bytes File group ID (var) variable Variable length data field
local file header signature 4 bytes (0x04034b50) version needed to extract 2 bytes general purpose bit flag 2 bytes compression method 2 bytes last mod file time 2 bytes last mod file date 2 bytes crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes file name length 2 bytes extra field length 2 bytes
file name (variable size) extra field (variable size)
想到 Mac 的 zip 是 Apple 自己实现的,很可能会填充某些特殊的字段。0x5455 就是一例,它代替了标准文档里的 0x000d 和 0x000a。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-Extended Timestamp Extra Field: ==============================
The following is the layout of the extended-timestamp extra block. (Last Revision 19970118)
Local-header version:
Value Size Description ----- ---- ----------- (time) 0x5455 Short tag for this extra block type ("UT") TSize Short total data size for this block Flags Byte info bits (ModTime) Long time of last modification (UTC/GMT) (AcTime) Long time of last access (UTC/GMT) (CrTime) Long time of original creation (UTC/GMT)
解释一下 Size 对应的字节。
All integer fields in the descriptions below are in little-endian (Intel) format unless otherwise specified. Note that "Short" means two bytes, "Long" means four bytes, and "Long-Long" means eight bytes, regardless of their native sizes. Unless specifically noted, all integer fields should be interpreted as unsigned (non-negative) numbers.
Byte 1 字节
Short 2 字节
Long 4 字节
Long-Long 8 字节
于是可以分析得到这图:
结论
变量的确是 Access Time。它存储在 local file header 的额外字段 (extra fields) 的扩展时间戳字段 (Extended Timestamp Extra Field) 中。
探索之旅终于达到了终点。
最后:如何使 zip 结果一致
如果你想每次 zip 打包出来的文件内容都一样,使用 -X 或 --no-extra 参数可以避免将 extra fields 打包进去。
1 2 3 4 5
echo hahahaha > xx rm -f a.zip b.zip zip -X a.zip xx zip -X b.zip xx md5 a.zip b.zip
轮询 100 次 for i in {1..100}; do ./d hahahaha;echo ""; done。
然后统计结果可以发现,zip 里存的 access time 只精确到秒,如果两次 zip 的时间 YYYY-MM-DD HH-mm-SS 都一样,则两次 zip 文件内容结果相同。
(其实也可以不用做实 (zhuang) 验 (bi),ZIP 文档里写着 access time 的格式是 The time values are in standard Unix signed-long format, indicating the number of seconds since 1 January 1970 00:00:00,说明只记录秒数。)
LG 是转成小端字节序的函数。 EB_UT_FL_ATIME 和 EB_HEADSIZE 的定义都能在 zip.h 文件 里找到。 ef_buf 是来自 zlist 的 extra 字段。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
/* Structures for in-memory file information */ structzlist { /* See central header in zipfile.c for what vem..off are */ ush vem, ver, flg, how; ulg tim, crc, siz, len; extent nam, ext, cext, com; /* offset of ext must be >= LOCHEAD */ ush dsk, att, lflg; /* offset of lflg must be >= LOCHEAD */ ulg atx, off; char *name; /* File name in zip file */ char *extra; /* Extra field (set only if ext != 0) */ char *cextra; /* Extra in central (set only if cext != 0) */ char *comment; /* Comment (set only if com != 0) */ char *iname; /* Internal file name after cleanup */ char *zname; /* External version of internal name */ int mark; /* Marker for files to operate on */ int trash; /* Marker for files to delete */ int dosflag; /* Set to force MSDOS file attributes */ structzlistfar *nxt;/* Pointer to next header in list */ };
zipinfo
在探索过程中我偶然发现 zipinfo 这个命令,非常好用。比如
1 2 3 4 5 6
𝕬 zipinfo a.zip
Archive: a.zip Zip file size: 161 bytes, number of entries: 1 -rw-r--r-- 3.0 unx 7 tx stor 18-Jun-30 16:03 xx 1 file, 7 bytes uncompressed, 7 bytes compressed: 0.0%
这里 unx 里的 x 是一个标志位,它代表文件包含 extra field。具体解释可以看 man zipinfo,查找 extra field 关键字:
1 2 3 4 5
The second character may also take on four values, depending on whether there is an extended local header and/or an ``extra field'' associated with the file (fully explained in PKWare's APPNOTE.TXT, but basically analogous to pragmas in ANSI C--i.e., they provide a standard way to include non-standard information in the ar- chive). If neither exists, the character will be a hyphen (`-'); if there is an extended local header but no extra field, `l'; if the reverse, `x'; and if both exist, `X'. Thus the file in this example is (probably) a text file, is not encrypted, and
Zip archive file size: 161 (00000000000000A1h) Actual end-cent-dir record offset: 139 (000000000000008Bh) Expected end-cent-dir record offset: 139 (000000000000008Bh) (based on the length of the central directory and its expected offset)
This zipfile constitutes the sole disk of a single-part archive; its central directory contains 1 entry. The central directory is 72 (0000000000000048h) bytes long, and its (expected) offset in bytes from the beginning of the zipfile is 67 (0000000000000043h).
Central directory entry #1: ---------------------------
xx
offset of local header from start of archive: 0 (0000000000000000h) bytes file system or operating system of origin: Unix version of encoding software: 3.0 minimum file system compatibility required: MS-DOS, OS/2 or NT FAT minimum software version required to extract: 1.0 compression method: none (stored) file security status: not encrypted extended local header: no file last modified on (DOS date/time): 2018 Jun 30 16:03:10 file last modified on (UT extra field modtime): 2018 Jun 30 16:03:09 local file last modified on (UT extra field modtime): 2018 Jun 30 08:03:09 UTC 32-bit CRC value (hex): 16b28489 compressed size: 7 bytes uncompressed size: 7 bytes length of filename: 2 characters length of extra field: 24 bytes length of file comment: 0 characters disk number on which file begins: disk 1 apparent file type: text Unix file attributes (100644 octal): -rw-r--r-- MS-DOS file attributes (00 hex): none
The central-directory extra field contains: - A subfield with ID 0x5455 (universal time) and 5 data bytes. The local extra field has UTC/GMT modification/access times. - A subfield with ID 0x7875 (Unix UID/GID (any size)) and 11 data bytes: 01 04 f5 01 00 00 04 14 00 00 00.
There is no file comment.
注意这里的 file last modified 并不是 extra field 里的值,而是 local file header 里的 last mod file time 和 last mod file date 字段。
zipinfo 没有翻译 extra field 内部的内容,所以你看不到 access time 的值,只是简要说明了一下:
1 2 3 4 5
The central-directory extra field contains: - A subfield with ID 0x5455 (universal time) and 5 data bytes. The local extra field has UTC/GMT modification/access times. - A subfield with ID 0x7875 (Unix UID/GID (any size)) and 11 data bytes: 01 04 f5 01 00 00 04 14 00 00 00.