Mark0's Forum
		Software => TrID File Identifier => Topic started by: jenderek on August 26, 2017, 03:17:00 PM
		
			
			- 
				Hello,
 
 when handling Microsoft Cabinet on Wikipedia page concerning that file
 format on https://en.wikipedia.org/wiki/Cabinet_(file_format) there
 Microsoft Publisher document with "Pack and Go" feature and PUZ
 extension is mentioned.
 
 When i run TrID on a such PUZ files created by Microsoft Publisher
 2003 these are identified too general as "Microsoft Cabinet Archive"
 with wrong "CAB" extension by ark-cab.trid.xml ( see appended
 output/trid-old.txt ).
 
 So i run tridscan and manually tuned finally generated ark-cab-puz.trid.xml.
 
 There does not exist an official documentation or complete specification
 for such packed files. Best information about such PUZ files is found
 on fileformats.archiveteam web page. So i used that page by reference
 URL:
 <RefURL>http://fileformats.archiveteam.org/wiki/PUZ</RefURL>
 
 Because of incompleteness of file format description i keep as much of
 file type characteristics.
 
 For PUZ extension Publisher 2003 does not register this file type. Because PUZ
 files are cabinet archives use that mime type by line:
 <Mime>application/vnd.ms-cab-compressed</Mime>
 
 According to Microsoft Cabinet Format specification found at
 https://msdn.microsoft.com/en-us/library/bb267310.aspx CABinet archives start
 with file signature and reserved1 area. Reserved areas are set to zero.
 This is expressed by first XML construct:
 <Pattern>
 <Bytes>4D53434600000000</Bytes>
 <ASCII> M S C F</ASCII>
 <Pos>0</Pos>
 </Pattern>
 
 At offset 30 cabinet archive flag is stored as short little endian value 0.
 Value 1 and 2 are used to for additional header bytes for building cabinet
 chains (for example PRECOPY1.CAB-> PRECOPY2.CAB->PRECOPY3.CAB). Obviously
 this is not used for PUZ files. Value 4 is used to reserve additional bytes in
 header for something. This is not found for observed PUZ files.
 At position 32 ID is stored as short. For all inspected examples this was
 0000h.
 iCabinet at offset 34 is number of cabinet file in a set, where 0 for
 the first cabinet. Apparently this is 0 for PUZ files.
 These 3 facts are now expressed by third XML construct:
 <Pattern>
 <Bytes>000000000000</Bytes>
 <Pos>30</Pos>
 </Pattern>
 
 At offset 24 of cabinet file format version is stored. Currently only
 versionMajor = 1 and versionMinor = 3 exist.
 At offset 38 number of CFFOLDER entries in stored as short "cFolders". Only
 value 1 found.
 
 Flag value 0 also means no optional bytes or in other word header is minimal
 (36 bytes), CFFOLDER structure is minimal (8 bytes) and CFFILE structure is
 minimal (16 bytes + name bytes).
 offset of the first long CFFILE entry "coffFiles" at offset 16 should be
 equal to sum of size of header(36) and folder entries size (8).  Yes, this is
 true (44 ~ 2Ch) if only 1 folder entry.
 
 These 5 variables (reserved2, coffFiles, reserved3, version and cFolders) are
 now expressed by second XML construct:
 <Pattern>
 <Bytes>000000002C0000000000000003010100</Bytes>
 <Pos>12</Pos>
 </Pattern>
 
 At position 36 CFFOLDER structure starts with offset of the first CFDATA block
 stored as long "coffCabStart". Normally this value is low. That was expressed
 by construct.
 <Pattern>
 <Bytes>000000</Bytes>
 <Pos>37</Pos>
 </Pattern>
 But value grows if archive contains more files and member names are longer. So
 removed that pattern.
 
 At position 40 number of CFDATA blocks is stored as short "cCFData". For
 inspected small examples this was only some blocks. So upper byte was null.
 At position 42 compression type is stored as short "typeCompress" For
 inspected examples this was always MSZIP (=0001h). This is now expressed by
 XML construct:
 <Pattern>
 <Bytes>0100</Bytes>
 <Pos>42</Pos>
 </Pattern>
 
 At position 44 structure CFFILE starts. At position 48 uncompressed offset of
 file is stored as long "uoffFolderStart". For the first file, this value will
 usually be zero. At position 52 index into the CFFOLDER area is stored as
 short "iFolder". A value of zero indicates this is the first folder in this
 cabinet file. These 2 facts are now expressed by XML construct:
 <Pattern>
 <Bytes>000000000000</Bytes>
 <Pos>48</Pos>
 </Pattern>
 
 At position 54 short values for date and time are stored. These of course are
 different after creating also packed01-postcard7.puz (1 Jan 1980) with help
 of hex editor.
 
 At position 58 member attribute are stored as short "attribs". When we believe
 in Microsoft's CAB specification, where highest bit is given by _A_NAME_IS_UTF
 with value 0x80 high byte of attribute is never used. For inspected examples i
 always found 0020h, which means _A_ARCH flag is set, because file is modified
 since last backup. With help of hex editor create example
 packed01-postcard7.puz with attribs=0. This example was accepted by
 Microsoft Office Publisher PNG Unpack.exe. So attrib value is expressed by
 XML construct:
 <Pattern>
 <Bytes>00</Bytes>
 <Pos>59</Pos>
 </Pattern>
 
 Afterwards first archive member name is stored. Unfortunately documentation is
 incomplete. The only mentioned fact is that a original Publisher document
 with name "MyPublication.pub" becomes member with name "MyPublicationPNG.pub"
 after "Package and Go" feature is used. When looking in output of 7zip ( see
 output/7z-l.txt) the PUB document is sometimes first member and sometimes
 second member. So this fact is expressed by XML construct:
 <GlobalStrings>
 <String>PNG.PUB</String>
 </GlobalStrings>
 
 Furthermore i summarize all observed facts in remark line like
 <Rem>
 packed01.puz with 1 CFFOLDER, no extra bytes and MZIP compression contains
 Microsoft Publisher document *PNG.pub and associated files created by
 "Pack and Go" feature of Publisher.
 </Rem>
 
 With new definition file all inspected PUZ files are now described more
 precise (See appended output/trid-new.txt).
 
 TrID definition, some examples and output are stored in archive puz.zip .
 I hope that my XML file can be used in future version of triddefs.
 
 With best wishes
 Joerg Jenderek
 
- 
				Thanks Joerg!