Author Topic: updated tscomp.trid.xml for TSComp compressed data + 2 variants  (Read 989 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
updated tscomp.trid.xml for TSComp compressed data + 2 variants
« on: December 02, 2023, 04:01:20 PM »
Hello trid users,

some days ago i must look for some old software samples. Unfortunately these
are packed in some compressed archives. So it took me some hours to find how
to extract such archives and what are the content of my inspected archives.

So i run trid utility on such TSComp compressed archives. The samples are
recognized and described as "TSComp compressed data" by tscomp.trid.xml with
generic mime type application/octet-stream. No file name suffix are listed
(see appended trid-tscomp-v.txt in output).

For comparison reason i also run the file format identification utility DROID
( See https://sourceforge.net/projects/droid/). This does "recognize" the LIB
archives. These are described as "Generic Library File" by PUID
x-fmt/425. This detection happens based on unreliable file name suffix LIB.

For comparison reason i also run file command (version 5.45) on such
samples. Here such samples are also recognized and described as "TSComp
archive data" (see appended file-5.45.txt in output ). The mime type here is
application/octet-stream (see in output appended file-i-5.45.txt ). Here no
file name suffix is listed (see in output appended file-ext-5.45.txt ).

With newer version (archive,v 1.195 2023/12/02) more details (like
wildcard style, member names and time stamps) are listed (see appended
file.txt in output). application/x-tscomp-compressed is now the mime
type (see in output appended file-i.txt). For singe varaint possible
two suffix ??$/??! are listed and for "multi" variant 6 possibilities
/lib/cmp/$$$/tsc/pak are listed (see in output appended file-ext.txt).

What in principal all tool use for recognition is a characteristic byte
sequence at the beginning. That is expressed inside front block by XML
construct like:
   <Bytes>655D138C0801</Bytes>
   <ASCII> e ]</ASCII>
   <Pos>0</Pos>

Unfortunately no reference is listed in TrID definition. With the help of
these tools i found pages about TSComp on web site file formats archive
team. There also samples to download and unpacking software like deark are
listed. That is now expressed inside definition by line like:
   <RefURL>http://fileformats.archiveteam.org/wiki/TSComp</RefURL>

According to that reference the file name suffix depends on sub
classification. For single-file archives, often the last letter of the
filename extension is changed to "$", but i also found samples where
exclamation mark instead of dollar sign is used (like BUILD3.BM!). For
multi-file archives, the most common extensions seem to be '.lib' and '.cmp',
but is also found other names {like SAMPMIF$ (no file name suffix) OTDATA.$$$
TWOFILES.TSC (obviously abbreviation for tscomp) WIN.PAK (obviously an
abbreviation for packed)}. Luckily the decompressing software deark can
extract archive contents by command like:
    deark -m tscomp -d2 MAKERRES.DL$

I am no c-programmer, but when interpreting source right then in my "multi"
file samples the filename style value is 2, which means "with wildcards". For
single samples the style is 1, which means no wildcard. Unfortunately i found
no "old" examples with style value 0. This style byte is apparently stored at
offset 8.

So i run tridscan on samples with no wildcard to generate variant
tscomp-single.trid.xml.

According to documentation for single-file archives often (127 of 145 in my
examples) the last letter of filename extension is changed to $. Then there
are samples without suffix where last character of archive name is $ character
(like in SHELDLL$). Then there are a dozen (15/145 BUILD1.BM!  BUILD2.BM!
BUILD3.BM! DOCONLIN.BM! FLWHAND1.BM! FLWHAND2.BM!  FLWHAND3.BM! QKCOLLG.BM!
QKDUCTS.BM! QKHOUSE.BM! QKPOWER.BM!  QKQUAD.BM! QKROOT.BM! QKSAVE.BM!
QKTEMP.BM!) examples where an exclamation mark (!) is used instead of dollar
sign. Often apparently this is used when there already exist an archive with $
at the end. So i mention these facts in a remark line. Then i found few (3/145
LOWP____.PF1 LOWP____.TTT MERCLET2.STA) samples with "normal" suffix. For me
the archive creator directly choose another name to avoid name collision but i
see no rule how the suffix in these samples are chosen.
So for executbles i get EX$, for Windows bitmaps BM$ and so on. So i get a
bunch of suffix in TrID definition. So if you do not like this then you can
use an empty suffix list like in tscomp.trid.xml.

So i run tridscan on "multi" samples with wildcard to generate variant
tscomp-multi.trid.xml.

According to documentation for multi-file archives, the most common extensions
seem to be lib and cmp. But in my 33 examples i get 4 and 2. Most (25 like
SAMPMML$) have no suffix but a $ sign as last file name character. Then i
found 2 examples (OTDATA.$$$ OTUPDATE.$$$) with 3 dollar sign as suffix. Then
i found one example (DEMO.DO$) which name looks like single. The i found few
more samples with "normal" suffix { like TWOFILES.TSC obviously abbreviation
for tscomp) WIN.PAK (obviously an abbreviation for packed)}. So i mention this
fact in a remark line. So i get here only 6 file name suffix. That is
expressed by line like:
   <Ext>CMP/LIB/$$$/PAK/TSC/DO$</Ext>

I you believe that the few unusual samples are accidents then you can short
this line. In most case the "multi" variant contains more than 1 archive
member, but for few samples this not true (like CRW3.LIB PSP1.CMP DICT2$
TMPNWSL$). The meaning of style byte value 2 is "with wildcard". So in most
cases this means more than one file, but this not always true. So i mention
these facts in the remark line.

In variant tscomp-multi.trid.xml the first XML construct looks like:
   <Bytes>655D138C08010300020000000012</Bytes>
   <ASCII> e ]</ASCII>
   <Pos>0</Pos>

In deark source installshld.c it checks for 4 byte 655d138c at the
beginning. According to tscomp.trid.xml this is followed by 0801 and file
command assumes that this followed by 0300. But i do not know if this is
always true and i was not able to understand deark source at this point
because i am no c programmer. The next byte with value 2 is the the file name
style, where 2 means "with wildcards". This is the characteristic for "multi"
variant samples. This is followed by 4 nil byte and byte with value 12h. I do
not know what this means and if this is always true. So i keep the above
construct.

According to deark source and debug output apparently at offset 14 the
compressed length of first member is stored as 4 byte little endian.  So in my
sample i get "low" values like 642159 (0009CC6Fh CRW3.LIB). So the upper byte
in my samples is nil. So that was expressed by line like:
   <Bytes>00</Bytes>
   <Pos>17</Pos>
Assuming that size can reach 32 limit the above construct vanish.

According to deark source and debug output apparently at offset 18 the offset
of next, second member structure is stored as 4 byte little endian.  So in my
sample i get "low" values. So the 2 upper bytes in my samples is nil. So that
was expressed by line like:
   <Bytes>0000</Bytes>
   <Pos>20</Pos>
Assuming that this offset can reach 32-bit limit this construct vanish.

The last construct look like:
   <Bytes>0000</Bytes>
   <Pos>26</Pos>

I do not know what this means but seems to be constant. So i keep it.

Then i do same procedure for tscomp-single.trid.xml. The difference is that
file name style byte has value 1. That means no wildcard. Because archive
contains only 1 file then of course the offset of second member is zero. So
that is expressed by XML construct like:
   <Bytes>00000000</Bytes>
   <Pos>18</Pos>

TrID definition and output are stored in archive tscomp_trid.zip. I hope that
my definitions can be used in future version of triddefs.

With best wishes
Jörg Jenderek

jsummers

  • Newbie
  • *
  • Posts: 5
Re: updated tscomp.trid.xml for TSComp compressed data + 2 variants
« Reply #1 on: December 02, 2023, 09:10:00 PM »
I've never seen any documentation of this format. I only figured out just enough to decompress the files I had.

I don't really know what names have been used for this format, so there might be more information about it out there than there seems to be.



Regarding the "filename style"(?) byte at offset 8... If you create an archive without using wildcards, like "tscomp a.dat a.da$", the byte's value is 1. If you use wildcards, like "tscomp *.dat example.lib", its value is 2. If you use "tscomp -l" to examine the archive files, it will tell you either "This file was NOT created using a wildcard and must be assigned a name during compression", or "This file was created using a wildcard and will be decompressed using the original file names."

If you hex edit one of the files, and set the byte at offset 8 to 0, then "tscomp -l" says "Old version of compress does not have file listings". So I guess there is an older version of the format. But I've never seen such a file.



Deark doesn't read the field at offset 18 (really, offset 5 of a member file segment). But I think you've decoded it. Looks like the absolute file position of the next member file (or 0).

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: updated tscomp.trid.xml for TSComp compressed data + 2 variants
« Reply #2 on: December 04, 2023, 02:10:16 AM »
Thanks to both for all the info!