Author Topic: variant sz.trid.xml replacements kwaj*.trid.xml for Microsoft compressed *.??$  (Read 4396 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello,

when i run trid on hundreds of old Microsoft compressed files some are not
recognized (see appended output/trid-old.txt).

Information about MS-DOS installation compression files is found at
http://fileformats.archiveteam.org. So i add this page as reference URL to
trid definition files by line:
<RefURL>
http://fileformats.archiveteam.org/wiki/MS-DOS_installation_compression
</RefURL>

The link about SZDD and KWAJ formats description as szdd_kwaj_format.html is
dead. But found this document at
https://hwiegman.home.xs4all.nl/fileformats/compress/

According to documentation examples with names *$ are a third variant of MS
Compressed archives. Examples for third variant can be found at
http://www.qbasic.net/de/qbasic-downloads/compiler/qbasic-compiler.htm for
example.


The shared characteristic pattern '88F027' is not found at offset 4 but at
offset 3. So such examples are not recognized by trid. Furthermore the
start pattern is now string "SZ ". This is expressed by XML construct
   <Bytes>535A2088F02733D1</Bytes>
   <ASCII> S Z   . . ' 3</ASCII>
   <Pos>0</Pos>
in new sz.trid.xml. File name extension is expressed by line
   <Ext>??$</Ext>

The extracting tools are problematic. An universal unpacker does not
exist. EXPAND does not work for such case. Extracting succeeds with
UNPACK.EXE which i found on first Qbasis installation floppy. So mention
this fact in remark line. As help for user also add a specific user defined
mime type by line:

   <Mime>application/x-ms-compress-sz</Mime>

In tridscan generated definition file at offset 11 byte was null, because the
inspected original files are not so big. So mention this fact in remark line.

By new sz.trid.xml now examples like PWBBASIC.MX$ or RAMDRIVE.SY$ are
recognized (see appended output/trid-new.txt).

The second variant with starting "SZDD" string is detected by
szdd.trid.xml. Add also reference URL.

At offset 9 sometimes last character of original file name extension is
stored if not null. So mention this fact in remark line. So we can easily
see that ARIAL.TT_ is the MS compressed truetype font ARIAL.TTf.

For compressed archives name has an underscore as last character in file
name extension. This is now expressed by line
   <Ext>??_</Ext>
instead non preciously expression:
   <Ext>EX_</Ext>

The size of original uncompressed file is store as little endian long value
at offset 10. So mention this also in remark line.

It is very annoying to install and run an emulator like DOSBox in order to
extract then by DOS expand tool such compressed archives. The 7z can extract
such archive by MsLZ type format. So mention this fact also in remark line.
Therefor add a specific user defined mime type by line:
   <Mime>application/x-ms-compress-szdd</Mime>

At offset 8 used compression i stored. In most cases this is "A" (0x41).
According to https://www.betaarchive.com/forum/viewtopic.php?t=26161 "B" is
found in Windows 3.1 builds 026 and 034e. So such archive are not detected
by szdd.trid.xml. So mention fact about compression method in remark line.


The remaining variant starts with string "KWAJ". Unfortunately some archives
like WINWORD6.IN_ are not recognized by kwaj.trid.xml but are described
correctly by newest file(1) command (see appended output/file.txt).

According to documents for this variant compression method (range 0-4) is
stored at offset 8. That means that by construct

   <Bytes>4B57414A88F027D10300</Bytes>
   <ASCII> K W A J . . '</ASCII>
   <Pos>0</Pos>

only variant with method number 3 (that is LZ+Huffman) is recognized. So
current trid definition now becomes kwaj-v3.trid.xml. Again add mentioned
reference URL. Also add a specific user defined mime type by line:
   <Mime>application/x-ms-compress-kwaj</Mime>

At offset 12 header flags are stored as little endian short value. If this
is 0 like in example CORELSHW.RE_ then compressed data immediately starts
afterwards at offset 0xe, which is stored as LE short at offset 10.  For
examples with flag 0x19 like AEGYPTEN.BM_ original file size (4 bytes),
original filename with extension ( 13 Bytes AEGYPTEN.BMP with trailing \0 )
is stored afterwards, which gives now offset value 0x1f. So mentioned this
facts in remark line.

For examples like WINWORD6.IN_ create kwaj-v4.trid.xml . Pattern now looks like
   <Bytes>4B57414A88F027D10400</Bytes>
   <ASCII> K W A J . . '</ASCII>
   <Pos>0</Pos>

That means here compression method number 4 (tat is MS-ZIP) is used. So
mentioned this fact in remark line.

I found no examples for methods 0, 1 and 2. Could create 3 more trid
kwaj-v?.trid.xml. Finally i create kwaj.trid.xml generic for all KWAJ
variants with pattern construct like:

   <Bytes>4B57414A88F027D1</Bytes>
   <ASCII> K W A J . . '</ASCII>
   <Pos>0</Pos>
   
   <Bytes>00</Bytes>
   <Pos>9</Pos>
   
With new trid definitions files now all inspected Microsoft compressed
archive are now recognized (see appended output/trid-new.txt).  TrID
definition, some examples and output are stored in archive XY_.zip . I hope
that the 5 XML files can be used in future version of triddefs.

With best wishes
J?rg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Many thanks!
P.S.
For the extension I have to avoid to use wildcards because it's simply added to a filename when renaming (at least for the moment).
So either I put there a sample/typical extension (like EX_) or maybe I'll leave it empty.