Hello trid users,
some days ago i handled some non Microsoft Office documents. When running
TrID on samples with file name extension SXC, STC, some examples are only
described as "Zip document container (generic)" by zip-doc-cont.trid.xml or
are misidentified as "XMind Workbook" by xmind.trid.xml. The SXC samples are
identified correctly as "StarOffice Calc spreadsheet" by sxc.trid.xml (See
appended output/trid-v-old.txt).
For comparison reason i also run other file identifying tools. The
file(1)command identifies most samples as "OpenOffice.org 1.x Calc" (See
appended output/file-5.38.txt). So for the Calc templates there exist no
trid definitions.
So i run tridscan to generate sxc-staroffice.trid.xml for Calc templates
variant. The format of such StarOffice/OpenOffice.org examples is described
for example at file formats archive team site. That is expressed by
reference URL line:
<RefURL>
http://fileformats.archiveteam.org/wiki/OpenOffice.org_XML </RefURL>
According to reference such global or master documents get their own mime
type. That 37 byte sized string is expressed by line:
<Mime>application/vnd.sun.xml.writer.calc.template</Mime>
Then i start to refine the trid definition file to get same structure as for
other StarOffice trid definitions. The mime type string is also find at the
beginning of the ZIP container. That is expressed by XML construct like:
<Bytes>6D696D65747970656170706C69636174696F6E2F766E64
<ASCII> m i m e t y p e a p p l i c a t i o n / v n d
<Pos>30</Pos>
That string is stored as clear text without any compression. So value for
packing method is zero. That is expressed by XML construct.
<Bytes>0000</Bytes>
<Pos>8</Pos>
Because the string is stored uncompressed, so value for compressed and
uncompressed size of first archive member has the same value 37. That is
0x25 in hexadecimal. That is expressed by XML-construct:
<Bytes>250000002500000</Bytes>
<Pos>18</Pos>
The mime type is always stored in a file with ASCII name mimetype. So the
size of this filename is eight. That is expressed by XML construct:
<Bytes>0800</Bytes>
<Pos>26</Pos>
Pattern at higher offset happened by lucky circumstances. So i remove such
patterns. I also get lines like
<String>MANIFEST.XMLPK</String>
That is triggered by file inside zip archive with name manifest.xml. If the
header entry contains no extra field, then after stored file name the next
ZIP fragment start with magic string PK. That was true for my inspected
samples, but now where is explicitly written that this a strict
requirement. And i remember that for some java JAR files i found examples
with extra fields. So i remove such appended PK string parts in all
patterns. So in global string section now lines becomes like:
<String>MANIFEST.XML</String>
But without appended PK string is not so easy any more to distinguish the
Calc spreadsheet from template variant. This is now only done by additional stored
length of mime type. That is 37 (=25h) for template and 28 (=1Ch) for
spreadsheet variant.
With the additional stc-staroffice.trid.xml now the unrecognized STC samples
are now described as "StarOffice Calc template" ( See appended
output/trid-v-new.txt).
For StarOffice Calc spreadsheet like Calc_6.sxc no mime type is shown So i
add to sxc.trid.xml the following line:
<Mime>application/vnd.sun.xml.calc</Mime>
The used URL reference to OpenDocument page on Wikipedia. That format is the
successor of the described format. That is used in OpenOffice.org 1.x and
StarOffice 6 and 7. This format is described on OpenOffice.org XML page on
Wikipedia. That is now expressed by line:
<RefURL>
https://en.wikipedia.org/wiki/OpenOffice.org_XML</RefURL>
With the 1 additional trid definition and one updated definition now the
unrecognized StarOffice Calc variant is detected and all definitions have
correct reference URL and mime type ( See appended output/trid-v-new.txt).
TrID definitions, output and some examples stored in archive s_calc.zip. I
hope that my 2 XML files can be used in future version of triddefs.
There may exist many StarOffice Calc samples which are not recognized by
these 2 trid definitions. After handling many StarOffice/OpenOffice
documents there seems to occur some typical errors:
1 No mimetype file.
2 mimetype file is not first archive member.
3 content of mimetype file are packed with non zero method like deflate.
So file type identifying tools looking for characteristic byte sequences like
trid, file droid fail. The reason is that the mimetype file is just for user
information. That means that office software or spreadsheet programs works
perfectly with such "bad"-formed documents.
With best wishes
Jörg Jenderek