Hello,
when handling Microsoft Cabinet on Wikipedia page concerning that file
format on
https://en.wikipedia.org/wiki/Cabinet_(file_format) there
Microsoft Publisher document with "Pack and Go" feature and PUZ
extension is mentioned.
When i run TrID on a such PUZ files created by Microsoft Publisher
2003 these are identified too general as "Microsoft Cabinet Archive"
with wrong "CAB" extension by ark-cab.trid.xml ( see appended
output/trid-old.txt ).
So i run tridscan and manually tuned finally generated ark-cab-puz.trid.xml.
There does not exist an official documentation or complete specification
for such packed files. Best information about such PUZ files is found
on fileformats.archiveteam web page. So i used that page by reference
URL:
<RefURL>
http://fileformats.archiveteam.org/wiki/PUZ</RefURL>
Because of incompleteness of file format description i keep as much of
file type characteristics.
For PUZ extension Publisher 2003 does not register this file type. Because PUZ
files are cabinet archives use that mime type by line:
<Mime>application/vnd.ms-cab-compressed</Mime>
According to Microsoft Cabinet Format specification found at
https://msdn.microsoft.com/en-us/library/bb267310.aspx CABinet archives start
with file signature and reserved1 area. Reserved areas are set to zero.
This is expressed by first XML construct:
<Pattern>
<Bytes>4D53434600000000</Bytes>
<ASCII> M S C F</ASCII>
<Pos>0</Pos>
</Pattern>
At offset 30 cabinet archive flag is stored as short little endian value 0.
Value 1 and 2 are used to for additional header bytes for building cabinet
chains (for example PRECOPY1.CAB-> PRECOPY2.CAB->PRECOPY3.CAB). Obviously
this is not used for PUZ files. Value 4 is used to reserve additional bytes in
header for something. This is not found for observed PUZ files.
At position 32 ID is stored as short. For all inspected examples this was
0000h.
iCabinet at offset 34 is number of cabinet file in a set, where 0 for
the first cabinet. Apparently this is 0 for PUZ files.
These 3 facts are now expressed by third XML construct:
<Pattern>
<Bytes>000000000000</Bytes>
<Pos>30</Pos>
</Pattern>
At offset 24 of cabinet file format version is stored. Currently only
versionMajor = 1 and versionMinor = 3 exist.
At offset 38 number of CFFOLDER entries in stored as short "cFolders". Only
value 1 found.
Flag value 0 also means no optional bytes or in other word header is minimal
(36 bytes), CFFOLDER structure is minimal (8 bytes) and CFFILE structure is
minimal (16 bytes + name bytes).
offset of the first long CFFILE entry "coffFiles" at offset 16 should be
equal to sum of size of header(36) and folder entries size (8). Yes, this is
true (44 ~ 2Ch) if only 1 folder entry.
These 5 variables (reserved2, coffFiles, reserved3, version and cFolders) are
now expressed by second XML construct:
<Pattern>
<Bytes>000000002C0000000000000003010100</Bytes>
<Pos>12</Pos>
</Pattern>
At position 36 CFFOLDER structure starts with offset of the first CFDATA block
stored as long "coffCabStart". Normally this value is low. That was expressed
by construct.
<Pattern>
<Bytes>000000</Bytes>
<Pos>37</Pos>
</Pattern>
But value grows if archive contains more files and member names are longer. So
removed that pattern.
At position 40 number of CFDATA blocks is stored as short "cCFData". For
inspected small examples this was only some blocks. So upper byte was null.
At position 42 compression type is stored as short "typeCompress" For
inspected examples this was always MSZIP (=0001h). This is now expressed by
XML construct:
<Pattern>
<Bytes>0100</Bytes>
<Pos>42</Pos>
</Pattern>
At position 44 structure CFFILE starts. At position 48 uncompressed offset of
file is stored as long "uoffFolderStart". For the first file, this value will
usually be zero. At position 52 index into the CFFOLDER area is stored as
short "iFolder". A value of zero indicates this is the first folder in this
cabinet file. These 2 facts are now expressed by XML construct:
<Pattern>
<Bytes>000000000000</Bytes>
<Pos>48</Pos>
</Pattern>
At position 54 short values for date and time are stored. These of course are
different after creating also packed01-postcard7.puz (1 Jan 1980) with help
of hex editor.
At position 58 member attribute are stored as short "attribs". When we believe
in Microsoft's CAB specification, where highest bit is given by _A_NAME_IS_UTF
with value 0x80 high byte of attribute is never used. For inspected examples i
always found 0020h, which means _A_ARCH flag is set, because file is modified
since last backup. With help of hex editor create example
packed01-postcard7.puz with attribs=0. This example was accepted by
Microsoft Office Publisher PNG Unpack.exe. So attrib value is expressed by
XML construct:
<Pattern>
<Bytes>00</Bytes>
<Pos>59</Pos>
</Pattern>
Afterwards first archive member name is stored. Unfortunately documentation is
incomplete. The only mentioned fact is that a original Publisher document
with name "MyPublication.pub" becomes member with name "MyPublicationPNG.pub"
after "Package and Go" feature is used. When looking in output of 7zip ( see
output/7z-l.txt) the PUB document is sometimes first member and sometimes
second member. So this fact is expressed by XML construct:
<GlobalStrings>
<String>PNG.PUB</String>
</GlobalStrings>
Furthermore i summarize all observed facts in remark line like
<Rem>
packed01.puz with 1 CFFOLDER, no extra bytes and MZIP compression contains
Microsoft Publisher document *PNG.pub and associated files created by
"Pack and Go" feature of Publisher.
</Rem>
With new definition file all inspected PUZ files are now described more
precise (See appended output/trid-new.txt).
TrID definition, some examples and output are stored in archive puz.zip .
I hope that my XML file can be used in future version of triddefs.
With best wishes
Joerg Jenderek