Author Topic: Updated sxw.trid.xml for StarOffice Writer document + 3 variants for *.stw *.sxg  (Read 1631 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i handled some non Microsoft Office documents. When running
TrID on samples with file name extension SXW, STW or SXG, some examples are
only described as "Zip document container (generic)" by
zip-doc-cont.trid.xml or are misidentified as "XMind Workbook" by
xmind.trid.xml. The SXW samples are identified correctly as "StarOffice
Writer document" by sxw.trid.xml (See appended output/trid-v-old.txt).

For comparison reason i also run other file identifying tools. When
running DROID identifying tool some examples are described like
"OpenDocument Text" ( see appended output/droid-swriter.csv).  The file(1)
command identifies most samples as "OpenOffice.org 1.x Writer" (See appended
output/file-5.38.txt). So for the Writer templates and global document
there exist no trid definitions.

So i run tridscan to generate sxg-staroffice.trid.xml for Writer global
document variant. The format of such StarOffice/OpenOffice.org examples is
described for example at file formats archive team site. That is expressed
by reference URL line:

   <RefURL>
   http://fileformats.archiveteam.org/wiki/OpenOffice.org_XML
   </RefURL>

According to reference such global or master documents get their own mime
type. That 37 byte sized string is expressed by line:

   <Mime>application/vnd.sun.xml.writer.global</Mime>

Then i start to refine the trid definition file to get same structure as for
other StarOffice trid definitions. The mime type string is also find at the
beginning of the ZIP container. That is expressed by XML construct like:

   <Bytes>6D696D65747970656170706C69636174696F6E2F766E64
   <ASCII> m i m e t y p e a p p l i c a t i o n / v n d
   <Pos>30</Pos>

That string is stored as clear text without any compression. So value for
packing method is zero. That is expressed by XML construct.

   <Bytes>0000</Bytes>
   <Pos>8</Pos>

Because the string is stored uncompressed, so value for compressed and
uncompressed size of first archive member has the same value 37. That is
0x25 in hexadecimal. That is expressed by XML-construct:

   <Bytes>250000002500000</Bytes>
   <Pos>18</Pos>

The mime type is always stored in a file with ASCII name mimetype. So the
size of this filename is eight. That is expressed by XML construct:

   <Bytes>0800</Bytes>
   <Pos>26</Pos>

Pattern at higher offset happened by lucky circumstances. So i remove such
patterns. I also remove nonsense lines in global string section like:

   <String>K'''CON</String>

I also get lines like

   <String>MANIFEST.XMLPK</String>

That is triggered by file inside zip archive with name manifest.xml. If the
header entry contains no extra field, then after stored file name the next
ZIP fragment start with magic string PK. That was true for my inspected
samples, but now where is explicitly written that this a strict
requirement. And i remember that for some java JAR files i found examples with
extra fields. So i remove such appended PK string parts in all patterns. So
in global string section now lines becomes like:

   <String>MANIFEST.XML</String>

With the additional OpenDocument.trid.xml now the unrecognized SXG samples
are now described as "StarOffice Master document" ( See appended
output/trid-v-new.txt).

In the same way i generate stw-staroffice.trid.xml for "StarOffice Writer
template" and stw-staroffice-web.trid.xml for "StarOffice Web template".

For StarOffice Writer documents like oooxml_embedded.sxw no mime type is
shown So i add to sxw.trid.xml the following line:

   <Mime>application/vnd.sun.xml.writer</Mime>

All such StarOffice Writer examples are described as "Zip document container
(generic)" by zip-doc-cont.trid.xml. But there no reference URL is listed and
as mime type "application/octet-stream" is displayed. So i add 2 lines
matching ZIP files like:

   <Mime>application/zip</Mime>
   <RefURL>http://en.wikipedia.org/wiki/Zip_(file_format)</RefURL>
      
In my latest trid downloaded definitions dated from 14 March 2020
ark-zip.trid.xml for "ZIP compressed archive" is missing, whereas in older
download dated from 1 February 2020 that definition still exist.

With the additional 3 trid definitions and 2 updated definitions now the
unrecognized StarOffice Writer variants are detected and all definitions have
reference URL and mime type ( See appended output/trid-v-new.txt).

TrID definitions, output and some examples stored in archive s_writer.zip. I
hope that my 5 XML files can be used in future version of triddefs.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Thanks for the updates and the new defs!
I think the mime-type at offset 30 is more than enough to identify the files, so the minor details about the Zip format peculiarities can be omitted.