Mark0's Forum
		Software => TrID File Identifier => Topic started by: jenderek on September 21, 2022, 08:40:19 PM
		
			
			- 
				Hello trid users,
 
 Some days ago i tried an alternative office suite to escape from Microsoft.
 That suite is called ONLYOFFICE and can be found at web site
 onlyoffice.com. It is said that the software is compatible to Microsoft
 office. Modern Word documents use file name extension DOCX. Such samples can
 be read and written by ONLYOFFICE. Such examples are described as "Word
 Microsoft Office Open XML Format document" by docx.trid.xml.
 
 Among other standard file types ONLYOFFICE offers 2 own file types. One has
 suffix DOCXF and is called in German variant "ONLYOFFICE-Formularvorlage".
 
 So i run trid utility on my docxf examples. All are described correctly with
 low rate generic as "ZIP compressed archive" by ark-zip.trid.xml with mime
 type application/zip. All examples are described "wrong" (suffix DOCX
 instead of DOCXF) as "Word Microsoft Office Open XML Format document" by
 docx.trid.xml (See appended output/trid-v-old.txt).
 
 For comparison reason i check these examples by file command utility. When
 running file command (version 5.43). Here all examples are also described
 generic as "Zip archive data" (See appended output/file-k-5.43.txt) with
 application/zip mime type (See appended output/file-ki-5.43.txt).  All my
 few examples are also described more specific as "Microsoft Word 2007+". For
 these examples the same wrong extension DOCX and mime type as by TrID is
 shown (See appended file-ext-5.43.txt file-i-5.43.txt in output).
 
 For comparison reason i also run the file format identification utility
 DROID ( See https://sourceforge.net/projects/droid/). This describes all
 examples also as "Microsoft Word for Windows" and with version "2007
 onwards" by PUID fmt/412. But software complains about file name suffix
 DOCXF instead of DOCX.
 
 The identifications by all tools as "new" Microsoft Word is not
 surprising. As described on a page on their web site DOCXF is based on
 DOCX. This fact is represented in new definition DOCXF.trid.xml by lines
 like:
 <FileType>ONLYOFFICE form template</FileType>
 <RefURL>
 https://www.onlyoffice.com/
 blog/2022/01/7-interesting-facts-about-onlyoffice-forms/
 </RefURL>
 
 Unfortunately they do not describe what exactly is the difference to
 Microsoft DOCX and to ONLYOFFICE OFORM. In my used program (variant for
 Windows version 7.1.1.57) they registered the file format as ASC.Docxf, but
 the connection to DOCXF file name suffix is missing. Also no mime type is
 found in Windows registry for that file type. That are too many easy
 elementary errors in my opinion. Because DOCXF is "different" from Microsoft
 DOCX that mime type can not be used. But because DOCXF are ZIP archives at
 least the mime of ZIP should be at least applied. So i do this by line like:
 <Mime>application/zip</Mime>
 
 After running tridscan to generate definition docxf.trid.xml i looked what
 XML construct are created and try to understand it.  I would like to reduce
 the XML constructs, but i was not able to do this because the ONLYOFFICE
 team does not explain what exactly is their additions or difference from
 DOCX format and ONLYOFFICE OFORM. So i do not know if XML constructs are
 always true or just triggered by lucky circumstances. So i keep at the
 moment all XML constructs.
 
 Because DOCXF are ZIP container we can inspect such examples by suited
 unpacking tools like 7-zip for example. There we see that all archive
 members have a time stamp of midnight of 1 January 1980 (See appended
 output/7z-l.txt). But i do not know if this is a bug or feature.  So 2 bytes
 for modification time for first member at offset 8 are nil. So 2 byte
 modification DOS date for first member at offset 10 are byte sequence
 2100. So this was expressed by first XML constructs which looks like:
 <Bytes>504B030414000000000000002100</Bytes>
 <ASCII> P K . . . . . . . . . . !</ASCII>
 <Pos>0</Pos>
 That is different from construct for generic ZIP archive by ark-zip.trid.xml
 which looks like:
 <Bytes>504B0304</Bytes>
 <ASCII> P K</ASCII>
 <Pos>0</Pos>
 
 With the new trid definition now all my DOCXF examples are described now
 more precisely (see appended output/trid-v-new.txt). TrID definition and
 output are stored in archive docxf_.zip. I hope that my XML file can be used
 in future version of triddefs.
 
 Unfortunately my definition is based on only few examples. So my definition
 still contains many short patterns, which are probably generated by lucky
 circumstances. Because of not explained difference to Microsoft DOCX format
 i kept all patterns. So maybe other users are welcome to improve the
 definition.
 
 With best wishes
 Jörg Jenderek
 
- 
				Thanks!