Author Topic: wpd-docfile.trid.xml for variant of Microsoft WordPerfect Document *.WPD  (Read 1189 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

Some days ago i handle a variant of WordPerfect document with WPD
extension. These are based on OLE 2 Compound Document format.

For my examples i got only a description "Generic OLE2 / Multistream
Compound" by docfile.trid.xml which is in principal correct but less
specific (See appended output/trid-v-old.txt ).

For comparison reason i also run the file utility (version 5.40). This
behaves similar. The examples are described generic as "OLE 2 Compound
Document" but no sub type classification was done. Instead only "UNKNOWN" is
shown, but luckily the used CLSID is shown in hexadecimal form as
ff739851ad2d200219370000929679cd .
With the help of online windows GUID converter like on toolslick.com i was
able to convert this in GUID form with curly braces and hyphen
separated. This now looks like:
   {519873FF-2DAD-0220-1937-0000929679CD}

The example ole6.wpd with WPD file name extension is found in test directory
of Wordperfect to latex converter "wp2latex" sources. This is found by URL
like:
   https://fossies.org/linux/wp2latex/test/ole6.wpd

Information about Wordperfect WPD format can be found for example at file
formats archive team web site. The mentioned information is now up to
date. For the inspected example i found on github a site with a document
WPFF_DocumentStructure.htm about WordPerfect File Format.  That information
is expressed by line like:
 <RefURL>http://justsolve.archiveteam.org/wiki/WordPerfect</RefURL>

The middle aged WordPerfect documents ( version 5 and 6 ) are characterized
by start pattern \xffWPC. Such examples are described as "WordPerfect
Document (generic)" by wpd-doc-gen.trid.xml and as "WordPerfect (generic)"
by wp-generic.trid.xml by XML construct like:
   <Bytes>FF575043</Bytes>
   <ASCII> . W P C</ASCII>
   <Pos>0</Pos>

According to reference since version "7" (WP7) this format is embedded as
PerfectOffice_MAIN stream inside Microsoft OLE Compound File. This can be
verified by extracting that stream via Michal Mutl MiTeC Structured Storage
Viewer for example. So for ole6.wpd i got ole6-PerfectOffice_MAIN.wpd (see
appended output/trid-v-old.txt).

According to reference characteristic is the from 1 to 2 raised minor
version. So for example the complete version is now 2.2 (see appended
output/file-5.40-e-cdf.txt)

I installed the trial version of WordPerfect Office 2021. This is able to
read and write such examples. Here this format is called WordPerfect
Compound File. So mentioned this fact in remark line like:
   <Rem>called "WordPerfect Compound File" by program</Rem>

Using the names from docfile.trid.xml and wpd-doc-gen.trid.xml the new
tridscan generated definitions is called wpd-docfile.trid.xml and the
description for such WPD examples is here expressed by line like:
   <FileType>
   WordPerfect Document (OLE2 Multistream Compound)
   </FileType>
For this format file name extension and mime type are the same as in older
versions.  That is expressed by lines like:
   <Ext>WPD</Ext>
   <Mime>application/vnd.wordperfect</Mime>

So in wpd-docfile.trid.xml i get characteristic Compound starting byte
sequence. That is expressed by XML construct like.

 <Bytes>
 D0CF11E0A1B11AE1000000000000000000000000000000003E000300FEFF0900060000000000000000000000
 <Pos>0</Pos>

Maybe this becomes shorter or splitted when more examples are inspected. All
my examples described by file command as version with number 3.62 (see
appended output/file-5.40-e-cdf.txt). If am calculating right this is
expressed by byte sequence 3E000300 inside first pattern.
So in generic docfile.trid.xml this characteristic starting sequence is
expressed by:
 <Bytes>D0CF11E0A1B11AE1</Bytes>
 <Pos>0</Pos>

Characteristic for OLE Compound examples is the UTF-16 little endian encoded
phrase "Root Entry". This was always found at offset 0x400 (1024 decimal) in
my examples. So i found that phrase inside byte sequence starting at offset
512. Maybe with more bigger and complicated examples that part occur at
other offsets. But this should always be expressed inside global string
section by line like:
   <String>R'O'O'T' 'E'N'T'R'Y</String>

According to documentation characteristic are 2 stream names
PerfectOffice_MAIN and PerfectOffice_OBJECTS encoded as UTF-16 little
endian. In my examples the first with document part was always found at
offset 1664 (680 hexadecimal) and the second was always found at offset 1792
(700 hexadecimal).  Maybe with more bigger and complicated examples that
part occur at other offsets. But this should always be expressed inside
global string section by lines like:
   <String>P'E'R'F'E'C'T'O'F'F'I'C'E'_'O'B'J'E'C'T'S</String>
   <String>P'E'R'F'E'C'T'O'F'F'I'C'E'_'M'A'I'N</String>

Because the definition generated by tridscan is based on only few examples i
get inside front block obviously short nil patterns like:
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1921</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1923</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1925</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1927</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1929</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1931</Pos>
   </Pattern>
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1933</Pos>
   </Pattern>
So i delete such patterns.

With the new definition all WPD Compound examples are now described (see
appended output/trid-v.txt). TrID definitions, some examples and output are
stored in wpd.zip. I hope that my XML file can be used in future version of
triddefs.

With best wishes
Jörg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: wpd-docfile.trid.xml for variant of Microsoft WordPerfect Document *.WPD
« Reply #1 on: September 06, 2021, 10:54:00 PM »
Thanks!