Hello trid users,
some weeks ago i run TrID on some Tape ARchive (*tar) which are no
recognised. Such examples like longname99-nobody-gnu.tar are described
as "Unknown!" ( see appended ustar/gnutar/output/trid-old.txt)
The newest patched file command on the other hand identifies such
examples correct as "tar archive" ( see appended
ustar/gnutar/output/file-new.txt).
When looking in current trid definition ark-tar.trid.xml we see what is
wrong. It contains a XML construct like:
<Pattern>
<Bytes>00000000...</Bytes>
<Pos>54</Pos>
</Pattern>
This is only true for TAR examples, where name of first archive member
has only 54 characters. So for examples like longname99-nobody-gnu.tar
where name of first member has 100 characters, which is allowed by tar
specification, this pattern becomes wrong and must be removed.
Some implementation use space as field padding character and some use null.
So pattern looking for end of 8 byte file mode is not always true. So in
general also more patterns must be removed like:
<Pattern>
<Bytes>00</Bytes>
<Pos>107</Pos>
</Pattern>
So in the end only 1 characteristic pattern survived. This pattern
contains the magic word "ustar".
In the end i create 3 ark-tar-*.trid.xml describing variants.
The first describes the GNU tar variant as "TAR - Tape ARchive (GNU)" by
ark-tar-gnu.trid.xml. This type can be created by --format=gnu option in
GNU tar command. So mention this fact in remark line. The file formats are
described in section of GNU tar manual by reference URL line:
<RefURL>
https://www.gnu.org/software/tar/manual/html_node/Standard.html </RefURL>
According to manual that variant use OLDGNU_MAGIC, that are 7 character
string "ustar " terminated by null. This is expressed by pattern
construct:
<Bytes>7573746172202000</Bytes>
<ASCII> u s t a r</ASCII>
<Pos>257</Pos>
So mention this fact in remark line. To distinguish such tar examples
from others instead usual "tar" extension also "gtar" can be used. This
is expressed by line:
<Ext>TAR/GTAR</Ext>
Furthermore also a more specific mime type is used by line
<Mime>application/x-gtar</Mime>
By that definition file longname99-nobody-gnu.tar is now recognized (See
ustar/gnutar/output/trid-new.txt).
The second definition file ark-tar-posix.trid.xml has a little different
specific pattern, that is described by XML pattern:
<Bytes>7573746172003030</Bytes>
<ASCII> u s t a r . 0 0</ASCII>
<Pos>257</Pos>
Here null terminated TMAGIC "ustar" followed by 2 byte TVERSION string
"00" is used. This is also described in Wikipedia page about tar file format.
So add that URL as reference by line:
<RefURL>
https://en.wikipedia.org/wiki/Tar_(computing)</RefURL>
So mention this fact in remark line. To distinguish such POSIX tar examples
from others instead usual "tar" extension also "ustar" can be used. This
is expressed by line:
<Ext>TAR/USTAR</Ext>
Furthermore also a more specific mime type is used by line
<Mime>application/x-ustar</Mime>
Such examples like long.ustar can be crated by GNU tar with "posix"
format option. By that second definition unrecognized posix tar now are
described precisely (See ustar/posix/output/trid-new.txt).
Third case with "ustar" magic is done by ark-tar-ustar.trid.xml. Here the
specific pattern looks like:
<Bytes>7573746172000000</Bytes>
<ASCII> u s t a r</ASCII>
<Pos>257</Pos>
Apparently in that tar implementation version is coded wrong as 2
hexadecimal 0 numbers instead of 2 ASCII "0" characters. Such tar
examples are found embedded inside android backup (*.ab). Only 1 file
name extension for that type is given by line:
<Ext>TAR</Ext>
And mime type is described by line:
<Mime>application/x-tar</Mime>
By that definition file examples like com.bigbuttons.ab.tar and
org.adblockplus.android.ab.tar are described more precisely as "TAR -
Tape ARchive (ustar)" (see ustar/ustar-other/output/trid-new.txt).
The situation remembers me like old weak encryption used in web browsers
or servers. Normal users and even doper do not know about weakness or
bugs and use predefined libraries.
And things become even worse. Before standard posix tar there exist
other tar file formats like V7 and older tar variants. All these types
have no "ustar" magic pattern. The first fields are the same as in
posix, but even padding is different.
Some implementations add new fields to the blank area at the end of the
header record created for example by DOS TAR ( version 3.21 delta 1997
Tim V.Shapore) with -j option for storing file comments like in example
TAR3214-j.TAR. Or some implementation like STAR add it's own 4 byte magic
'tar\0' at the end of header like in examples like gtarfail2.tar found at
https://sourceforge.net/projects/s-tar/files/testscripts/Because of absence of "ustar" such examples are described by trid as
"Unknown!" (see no-ustar/output/trid-old.txt).
After some inspection 2 characteristic null pattern seems to be shared
by all such inspected examples:
<Pattern>
<Bytes>0000000000000000</Bytes>
<Pos>500</Pos>
</Pattern>
<Pattern>
<Bytes>00</Bytes>
<Pos>511</Pos>
</Pattern>
Unfortunately the above patterns are too generic and match also other
files like "DOS 2.0-3.2 backup" (see example PGPK.EXE) or ISO 9660
CD-ROM (see example test-iso.iso)
I add Wikipedia page about tar file format by reference URL line:
<RefURL>
https://en.wikipedia.org/wiki/Tar_(computing)</RefURL>
According to Wikipedia at offset 156 type flag field which describes
type of first tar archive member. There exist some dozen types. Luckily
some must not be considered. The upper capital 'A'?'Z' flag are only
used in POSIX tar files. Or FIFO does not exist on old systems.
So only a few flags must be considered.
So i created a variant ark-tar-link.trid.xml with additional pattern
<Bytes>31</Bytes>
<ASCII> 1</ASCII>
<Pos>156</Pos>
Extension and mime type are described by lines:
<Ext>TAR</Ext>
<Mime>application/x-tar</Mime>
By that trid definition file with unique patterns example like
hardlinkPart-v7.tar is described now as "TAR - Tape ARchive (hard
link)". Then do the same procedure for 3 more types, which are described
by following table:
trid definition flagbyte member type description
ark-tar-link.trid.xml '1' hard link
ark-tar-symlink.trid.xml '2' symbolic link
ark-tar-dir.trid.xml '5' directory
ark-tar-file.trid.xml '0' normal file
With the above 4 trid definition files such examples are now recognised
(see no-ustar/output/trid-new.txt)
But life is not easy. I do not know if it is a bug or feature. The
current GNU tar program creates archive that is compatible with Unix V7
tar with option '--format=v7', even if tar archive contains GNU specific
extensions. So i create 3 additional definition files described by table:
trid definition flagbyte member type description
ark-tar-longname.trid.xml 'L' longname
ark-tar-multivol.trid.xml 'M' multi volume
ark-tar-vol.trid.xml 'V' volume
Because it is GNU tar extension and mime type are done by lines:
<Ext>TAR/GTAR</Ext>
<Mime>application/x-gtar</Mime>
With these 3 XML files examples like v119-gnu.tar are also recognised
(see no-ustar/gnutar/output/trid-new.txt).
Unfortunately normal file as first member can also be coded by
hexadecimal null instead ASCII 0. So an additional definition must be
created looking like described by table:
trid definition flagbyte member type description
ark-tar-null.trid.xml '\0' normal file
Then this definition file has only null patterns. So it is not so
unique. Assuming that tar example with file comments do not use maximal
length i add another null pattern like:
<Pattern>
<Bytes>00000000...</Bytes>
<Pos>464</Pos>
</Pattern>
With this last definition file examples like file-5.32.tar are now
recognized. Unfortunately still other examples like "DOS 2.0-3.2 backup"
PGPK.EXE, ISO 9660 CD-ROM test-iso.iso or Virtual PC Virtual HD image
win98se.vhd.bin are misidentified with low rate as "TAR - Tape ARchive
(null file)" ( see no-ustar/null-file/output/trid-new.txt).
Some people may say "do not care about so old tar file formats, it is
history". Unfortunately this is wrong. Like in web browsers old format
still exist today. Maybe for compatibility reasons. So today still
examples like file-5.32.tar use old tar format. Here the problem is also
not visible at first glance because the tar archive itself is gzip
compressed.
The second mind disturbing case are samples like TAR3214-j.TAR. Most tar
programs interpret the original file comments as filename prefix
without a warning, although file name prefix field exist only in modern
posix variant but not in old tar format. So program behaviour is maybe
like in tar bombs. So users may be happy if tar archives are identified
and are described more precisely.
With new trid definition files all inspected TAR archives are now
recognized. TrID definition, some examples or first tar blocks and
output are stored in archive tar_trid.zip. I hope that the XML files can
be used in future version of triddefs and that i do not over seen an
exotic tar format.
With best wishes
J?rg Jenderek