Author Topic: docxf.trid.xml for ONLYOFFICE form template  (Read 804 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
docxf.trid.xml for ONLYOFFICE form template
« on: September 21, 2022, 08:40:19 PM »
Hello trid users,

Some days ago i tried an alternative office suite to escape from Microsoft.
That suite is called ONLYOFFICE and can be found at web site
onlyoffice.com. It is said that the software is compatible to Microsoft
office. Modern Word documents use file name extension DOCX. Such samples can
be read and written by ONLYOFFICE. Such examples are described as "Word
Microsoft Office Open XML Format document" by docx.trid.xml.

Among other standard file types ONLYOFFICE offers 2 own file types. One has
suffix DOCXF and is called in German variant "ONLYOFFICE-Formularvorlage".

So i run trid utility on my docxf examples. All are described correctly with
low rate generic as "ZIP compressed archive" by ark-zip.trid.xml with mime
type application/zip. All examples are described "wrong" (suffix DOCX
instead of DOCXF) as "Word Microsoft Office Open XML Format document" by
docx.trid.xml (See appended output/trid-v-old.txt).

For comparison reason i check these examples by file command utility. When
running file command (version 5.43). Here all examples are also described
generic as "Zip archive data" (See appended output/file-k-5.43.txt) with
application/zip mime type (See appended output/file-ki-5.43.txt).  All my
few examples are also described more specific as "Microsoft Word 2007+". For
these examples the same wrong extension DOCX and mime type as by TrID is
shown (See appended file-ext-5.43.txt file-i-5.43.txt in output).

For comparison reason i also run the file format identification utility
DROID ( See https://sourceforge.net/projects/droid/). This describes all
examples also as "Microsoft Word for Windows" and with version "2007
onwards" by PUID fmt/412. But software complains about file name suffix
DOCXF instead of DOCX.

The identifications by all tools as "new" Microsoft Word is not
surprising. As described on a page on their web site DOCXF is based on
DOCX. This fact is represented in new definition DOCXF.trid.xml by lines
like:
 <FileType>ONLYOFFICE form template</FileType>
 <RefURL>
 https://www.onlyoffice.com/
 blog/2022/01/7-interesting-facts-about-onlyoffice-forms/
 </RefURL>

Unfortunately they do not describe what exactly is the difference to
Microsoft DOCX and to ONLYOFFICE OFORM. In my used program (variant for
Windows version 7.1.1.57) they registered the file format as ASC.Docxf, but
the connection to DOCXF file name suffix is missing. Also no mime type is
found in Windows registry for that file type. That are too many easy
elementary errors in my opinion. Because DOCXF is "different" from Microsoft
DOCX that mime type can not be used. But because DOCXF are ZIP archives at
least the mime of ZIP should be at least applied. So i do this by line like:
   <Mime>application/zip</Mime>

After running tridscan to generate definition docxf.trid.xml i looked what
XML construct are created and try to understand it.  I would like to reduce
the XML constructs, but i was not able to do this because the ONLYOFFICE
team does not explain what exactly is their additions or difference from
DOCX format and ONLYOFFICE OFORM. So i do not know if XML constructs are
always true or just triggered by lucky circumstances. So i keep at the
moment all XML constructs.

Because DOCXF are ZIP container we can inspect such examples by suited
unpacking tools like 7-zip for example. There we see that all archive
members have a time stamp of midnight of 1 January 1980 (See appended
output/7z-l.txt). But i do not know if this is a bug or feature.  So 2 bytes
for modification time for first member at offset 8 are nil. So 2 byte
modification DOS date for first member at offset 10 are byte sequence
2100. So this was expressed by first XML constructs which looks like:
   <Bytes>504B030414000000000000002100</Bytes>
   <ASCII> P K . . . . . . . . . . !</ASCII>
   <Pos>0</Pos>
That is different from construct for generic ZIP archive by ark-zip.trid.xml
which looks like:
   <Bytes>504B0304</Bytes>
   <ASCII> P K</ASCII>
   <Pos>0</Pos>

With the new trid definition now all my DOCXF examples are described now
more precisely (see appended output/trid-v-new.txt). TrID definition and
output are stored in archive docxf_.zip. I hope that my XML file can be used
in future version of triddefs.

Unfortunately my definition is based on only few examples. So my definition
still contains many short patterns, which are probably generated by lucky
circumstances. Because of not explained difference to Microsoft DOCX format
i kept all patterns. So maybe other users are welcome to improve the
definition.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2731
    • Mark0's Home Page
Re: docxf.trid.xml for ONLYOFFICE form template
« Reply #1 on: September 27, 2022, 02:17:31 AM »
Thanks!