Author Topic: OpenDocument.trid.xml for unrecognized OpenDocuments *.o??  (Read 1771 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
OpenDocument.trid.xml for unrecognized OpenDocuments *.o??
« on: March 17, 2020, 05:49:50 PM »
Hello trid users,

some days ago i handled some non Microsoft Office documents. When running
TrID on some samples with file name extension OTP, OTT, ODT or ODS, these
are only described as "ZIP compressed archive" by ark-zip.trid.xml or are
misidentified as "XMind Workbook" by xmind.trid.xml ( See appended
ods-other/output/trid-v-old.txt).

At the beginning i though these sample are accidents, because some are part
of test suits.

But i also find 111 samples with ott extension. Here also examples are
recognized as "Zip document container (generic)" by zip-doc-cont.trid.xml
(See appended ott-zip/output/trid-v-old.txt). But the examples are not
described as "OpenDocument Text Document template" by ott.trid.xml.

When running DROID identifying tool these examples are described like
"OpenDocument Text" ( see appended ott-zip/output/droid-ott.csv).

The first obvious difference to recognized samples, was the fact that file
with name "mimetype" is not stored as first zip archive member.  This is
visible when looking in listing of seven-zip ( See appended
ott-zip/output/7z-l.txt).

That also means that in front block section of trid definition file
the following XML construct vanishes:

   <Bytes>6D696D65747970656170706C69636174696F6E2F766E642E6F617369732E
   <ASCII> m i m e t y p e a p p l i c a t i o n / v n d . o a s i s .
   <Pos>30</Pos>

I also find an example like invalid_ooo3_2_doc3.odt where archive contains
no "mimetype" file. As a consequence in global strings section also a line
vanishes like:

   <String>MIMETYPEPK</String>

Then there exist OpenDocument many variants like 01_notes.ott where file
mimetype is first archive member and also contains correct value string
application/vnd.oasis.opendocument.text-template, but is not recognized by
trid definitions for OpenDocuments. I needed some days to find reason for
misbehavior.

When inspecting well behaved examples like 01_notes-stored.ott and looking
in seven zip listing with show technical information option, we see that
mimetype is first member and used compression method is "Store" (See
appended output/7z-l-slt.txt). According to ZIP APPNOTE value for this
method is short value zero. This should be expressed inside ott.trid.xml by
construct like

   <Bytes>0000</Bytes>
   <Pos>8</Pos>

For many unrecognized OpenDocuments like 01_notes.ott we see that mimetype
is first member and used compression method is "Deflate" (See appended
ott-zip/output/7z-l-slt.txt) According to ZIP APPNOTE value for this method
is short value eight. This would be expressed inside trid definition by
construct like:

   <Bytes>0800</Bytes>
   <Pos>8</Pos>

Because now the content of file mimetype is now deflated, the string
application/vnd.oasis.opendocument.text-template now do not occur any more as
normal clear text inside zip archive container. So in trid definition inside
front block section XML construct becomes like:


   <Bytes>6D696D6574797065
   <ASCII> m i m e t y p e
   <Pos>30</Pos>

I was able to generate well behaved OpenDocuments by extracting unrecognized
samples and repacking mimetype file by -0 option of zip command, that used
the "store" method without compression. If i do use this option, then often
the default is the same as -Z deflate zip option.

When repacking without --no-extra zip option like in example
01_notes-extra.ott then string mimetype is at correct offset in container,
but after that text does not follow the mime type value string
application/vnd.oasis.opendocument.graphics-template. This text occurs some
bytes later because after string mimetype now extra fields like 140 bytes
for security on Windows systems are stored. This is also visible by
additional phrase like "0x4453 UT" in characteristics of mimetype mentioned
in seven zip listing with show technical information option ( see appended
output/7z-l-slt.txt).

According to ZIP APPNOTE the highest available compression method value is 99
for AE-x Encryption. So at offset 8 method unequal 0 means not stored. This
is expressed inside OpenDocument.trid.xml by XML construct like:

   <Bytes>00</Bytes>
   <Pos>9</Pos>

Of courses in Front Block section the pattern for zip local file header
signature is found. That is described by XML construct like:

   <Bytes>504B0304</Bytes>
   <ASCII> P K</ASCII>
   <Pos>0</Pos>

The OpenDocument technical specification can be found on Wikipedia. That is
expressed by reference line like:

   <RefURL>
   https://en.wikipedia.org/wiki/OpenDocument_technical_specification
   </RefURL>
   
According to that page OpenDocument contains a directory META-INF and files
like manifest.xml and meta.xml. That is expressed inside global strings
section by lines like:

   <String>MANIFEST.XML</String>
   <String>META-INF</String>
   <String>META.XML</String>

With the additional OpenDocument.trid.xml now the unrecognized OpenDocument
are detected ( See appended output/trid-v-new.txt). I hope that the new trid
definition is not too generic to catch also non OpenDocument zip archives.

TrID definition, output and some examples stored in archive od_other.zip. I
hope that my new XML file can be used in future version of triddefs.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: OpenDocument.trid.xml for unrecognized OpenDocuments *.o??
« Reply #1 on: March 17, 2020, 09:10:57 PM »
Thanks for the new def.
I'm not sure about this one, but I'll surely try and see about it.