Author Topic: 3 variants pub-v?.trid.xml for Microsoft Publisher document  (Read 368 times)

jenderek

  • Sr. Member
  • ****
  • Posts: 375
Hello trid users,

some days ago i must handle an old CD-ROM. This contains some older Microsoft
Publisher files with file name suffix pub.

When i run the file format identification utility TrID a few "very old"
examples (like MSPublisherv1.PUB) are not recognized and are described as
"Unknown!". Most examples are described correctly in a generic way with low
priority as "Generic OLE2 / Multistream Compound" with mime type
application/x-ole-storage by docfile.trid.xml. Many of such described samples
are also described with highest priority as "Microsoft Publisher document"
with correct file name suffix (.PUB) and without mime type by pub.trid.xml The
exception are "middle old" samples (like MSPublisher95.PUB see appended
trid-v-old.txt in output).

For comparison reason i also run file command (version 5.45) on such
samples. Here "very old" samples (like MSPublisherv1.PUB) are not recognized
and described as "data". The recognized samples are described generic as
"Composite Document File V2 Document" (see appended file-5.45.txt in
output). When excluding in internal test of Compound Document Files (cdf) the
samples are described by magic files as "OLE 2 Compound Document" with more
details.  Most of the samples ("newest age" like MSPublisher97.pub until
MSPublisher2013-Sample.pub) are also described also correctly as "Microsoft
Publisher" (see appended file-soft-5.45.txt in output). For these samples mime
type application/vnd.ms-publisher (see appended file-soft-i-5.45.txt in
output) and correct file name suffix pub (see appended file-soft-ext-5.45.txt
in output) is shown. For the "middles aged" samples ( like MSPublisherv2.PUB
MSPublisher95.PUB) the correct sub classification fails and only "Microsoft"
is displayed. For these examples only generic application/octet-stream mime
type is shown. Also the pub file name suffix is not shown for such samples.

On Linux according to shared MIME-info database the "new age" samples are
called "Microsoft Publisher document". Here application/vnd.ms-publisher is
used as mime type and also file name suffix pub is shown. The samples are just
recognized by looking for byte sequence \xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1 at
the beginning. That is the characteristic for all Compound files. It also look
for bytes sequence
x01\x12\x02\x00\x00\x00\x00\x00\x00\xc0\x00\x00\x00\x00\x00\x46 at offset
range 592-8192. That characteristic is the unique clsid and is also used as
current pattern inside Magdir/ole2compounddocs. That information can be seen
in source freedesktop.org.xml.in found for example on gitlab.freedesktop.org.

For comparison reason i also run the file format identification utility DROID
(See https://sourceforge.net/projects/droid/). Here all samples are
recognized. The samples are here described as "Microsoft Publisher" and mime
type application/x-mspublisher. This does more sub classification. This is
shown in version files. Here i get values( like 1 2.0 95 97 98 200 2007 2010
2013). These are done by PUID ( like fmt/1511 x-fmt/252 /x-fmt/253 x-fmt/254
x-fmt/255 x-fmt/256 fmt/1513 fmt/1514 fmt/1515 see appended droid-pub.csv on
output).

Luckily with information given by the other tools i also found a page about
Microsoft Publisher on file formats archive team web site. There also links
for samples to download are listed. So i use the first page as reference in
the new definitions. That is expressed by line like:
 <RefURL>http://fileformats.archiveteam.org/wiki/Microsoft_Publisher</RefURL>

As mime i choose what is shown by file command and Linux shared data base. So
this is expressed by line like:
      <Mime>application/vnd.ms-publisher</Mime>
Unfortunately this not officially registered at IANA.

Unfortunately i do not understand how DROID do the sub classification.
Because i have done the soft pattern recognition for file command i know some
of the characteristics. So i choose how file file command do it as prototype.

The "very old" variant is only recognized by DROID. Unfortunately i myself get
only one example. So i could not use tridscan. This variant is the only one
which is not OLE2 / Multistream Compound based. According to reference URL the
first four bytes are constant and characteristic. So i translate this to XML
construct. That inside pub-v1.trid.xml looks like:
   <Bytes>E7AC2C00</Bytes>
   <ASCII> . . ,</ASCII>
   <Pos>0</Pos>

For newest samples i run tridscan to generate pub-v4.trid.xml. When looking at
different version samples this only applies to "newer" versions range
97-2013. There is a discrepancy with the documentation (What is the version).
According to documents this is version range 4.0-11.0. But when looking at
details (see appended debug output file-soft.tmp) there exist a directory
entry "\001CompObj". Because this encoded as UTF-16 little endian at different
offsets this is expressed inside Global Strings section by line like:
   <String>C'O'M'P'O'B'J</String>
When i look inside this stream for example by Michal Mutl Structured Storage
Viewer `SSView.exe MSPublisher97.pub` i see "version" two strings "Microsoft
Publisher 3.0" and "MSPublisher.3". So therefore i myself would call this
version 3, but according to the mentioned documentation this lowest new
samples is called version 4. I do not know if this is always true, but this
also one characteristic that is different compared to other variants. So i
mention my observations in remark line. These characteristic strings occur
encoded as UTF-16 at different offset are expressed inside global strings
section by lines like:
   <String>C'O'M'P'O'B'J</String>
   <String>MSPUBLISHER.3</String>
   <String>MICROSOFT PUBLISHER 3.0</String>

Then in debug output i see more directory entries. Because by file format
nature these are encoded as UTF-16. So these observations are expressed inside
global strings section by lines like:
   <String>R'O'O'T' 'E'N'T'R'Y</String>
   <String>C'O'N'T'E'N'T'S</String>
   <String>I'N'T'E'R'N'A'L</String>
   <String>O'B'J'E'C'T'S</String>

The first is always needed because every directory contains a root entry. When
looking and comparing with current pub.trid.xml which is based on 470 samples
the other 3 seems to be apparently always true.

There exist also a described directory entry with 19 character string
\005SummaryInformation. Because of UTF-16 character 38 bytes of name field are
occupied. The field has a size of 64 bytes. So the remaining 26 bytes are
filled with nil bytes. Afterward the size of the name part including string
terminator (28h=40=38+2) is stored as 2 byte integer. 28h expressed as ASCII
character is left parenthesis. So this is always true and is expressed inside
global strings section by lines like:
 <String>S'U'M'M'A'R'Y'I'N'F'O'R'M'A'T'I'O'N'''''''''''''''''''''''''''(</String>
The aim is to keep TrID database as small as possible, because that influence
the used disk space and the search speed. So we want to keep only pattern
which are significant to get unique file type recognition. But this really can
only be done when the full file format specification is known. On the other
hand the pattern should be consistent with definition for other Publisher
variants. So the above line now becomes like:
 <String>S'U'M'M'A'R'Y'I'N'F'O'R'M'A'T'I'O'N</String>

Then one line inside global strings must be considered. That looks like:
   <String>$'''%</String>
When comparing with other definitions and with my knowledge about TrID this
line is probably triggered by using too few number (14) of scanned samples. So
i delete the above line.

The root directory entry is located at offset 1024. This is expressed by XML
construct like:
 <Bytes>52006F006F007400200045006E00740072007900
 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 16000500FFFFFFFFFFFFFFFF</Bytes>
 <ASCII> R . o . o . t .   . E . n . t . r . y</ASCII>
 <Pos>1024</Pos>

To keep definition small i cut this down to human readable UTF-16 string. So this
construct now becomes like:
 <Bytes>52006F006F007400200045006E00740072007900</Bytes>
 <ASCII> R . o . o . t .   . E . n . t . r . y</ASCII>
 <Pos>1024</Pos>

The 10 character string "Root Entry" occupies 20 bytes because of UTF-16
nature. So the remaining 44 bytes of 64 size filed are occupied with nil
bytes. Afterward the size of the name part including string terminator
(16h=22=20+2) is stored as 2 byte integer. Afterwards comes type of the entry
byte with value 05h. That means Root storage. Afterwards comes node colour
byte. The value 0h means read. The value 1 means black and is probably the
standard here. This is followed by two 4 bytes containing the DirID of the
left and right child node. Value -1 (FFffFFffh) means no child. So the above
construct is probably always true.

The offset of directory is calculated by formula like:
   1024 = offset= sector_size x (1 + root_SecID) = 512 x (1 +1)

The SecID of first sector of the directory stream is stored at offset 48 as 4
byte integer. Often this value is 1 but it can be higher. I am no Microsoft
expert, but i assume this happens when you add, remove and add again streams.
So when when you create the file in one step this value is probably always 1.

The block size or size of sector is stored as exponent to basis 2 at offset 30
as 2 byte integer. The minimum value is 7 which means block size 128. But i
can not remember to have seen this. The most used value is 9 which means block
size 512. This probably applies for major version 3, which is stored at offset
26 as 2 byte integer.  The second value i have seen was Ch for block size 4096
which is used for major version 4. Most Compound files use little endian. That
feature is triggered by 2 byte order identifier at offset 28. The sequence
FEFF means little endian whereas FFFE means big endian. The endian must be
applied to all values with containing more than one byte. All these facts are
expressed inside first two XML constructs. These look like:

 <Pattern>
   <Bytes>D0CF11E0A1B11AE100
   0000000000000000000000000000003E000300FEFF090006000000000000000000000
   0</Bytes>
   <Pos>0</Pos>
 </Pattern>
 <Pattern>
   <Bytes>000000010000000000000000100000</Bytes>
   <Pos>45</Pos>
 </Pattern>

So for the first i get the characteristic for Compound files like in
pub.trid.xml. This look like:
 <Pattern>
   <Bytes>D0CF11E0A1B11AE100</Bytes>
   <Pos>0</Pos>
 </Pattern>

The remaining part with major version 3, endian indicator and sector exponent
9 is expressed by XML construct like:
 <Pattern>
   <Bytes>0300FEFF0900</Bytes>
   <Pos>26</Pos>
 </Pattern>

The second XML construct with root SecID 1 is now expressed by shrinked XML
construct. That now looks like:
 <Pattern>
   <Bytes>01000000</Bytes>
   <Pos>48</Pos>
 </Pattern>

If root directory is located in examples at offset 1024 then second entry is
located at offset 1152 (=1024+128). Apparently second entry is Objects. That
is expressed by XML construct like:
 <Bytes>0000000000004F0062006A006500630074007300
 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100001
 </Bytes>
 <ASCII> . . . . . . O . b . j . e . c . t . s</ASCII>
 <Pos>1146</Pos>
I do not know if this is always true, but as characteristic pattern this is
not needed. So i delete this.

Then all XML constructs after the second til offset 1024 may be are constant
triggered by lucky circumstances and be always true. But these are not
relevant for recognition. So i delete these.

Then i got short nil sequences at higher offsets like:
   <Pattern>
      <Bytes>000000</Bytes>
      <Pos>1141</Pos>
   </Pattern>
   ...
   <Pattern>
      <Bytes>00</Bytes>
      <Pos>1543</Pos>
   </Pattern>
I assume that this are triggered by lucky circumstances. So i delete these.

Then i got longer nil or FF sequences like:
   <Pattern>
      <Bytes>01000000000000000000000000</Bytes>
      <Pos>1267</Pos>
   </Pattern>
   <Pattern>
      <Bytes>FFFFFFFFFFFFFFFFFFFFFFFF</Bytes>
      <Pos>1988</Pos>
   </Pattern>
Some are located in higher directory entries. So i assume that these are triggered by
padding of fields or are unused. So i delete these.

Then only some XML constructs are left. These look like:
   <Pattern>
      <Bytes>01</Bytes>
      <Pos>1139</Pos>
   </Pattern>
Apparently these are not relevant for unique recognition. So i delete these.

Then only one XML construct is left. That looks like:

   <Bytes>000000011202000000000000C0000000000046000000000000000000000000</Bytes>
   <ASCII> . . . . . . . . . . . . . . . . . . F</ASCII>
   <Pos>1101</Pos>

At relative offset 80 based on root directory entry 16 byte Unique identifier
can be stored. If this is zero then this feature is not used, but if it is not
zero then this is unique to do sub classification. For "newest" publisher
versions this is 0x011202000000000000c0000000000046. When root entry offset is
constant like 1024 then the offset of this CLSID is found at constant offset
(1104=1024+80). So the above construct can be shrinked to 16 bytes and then
looks like:

   <Bytes>011202000000000000C0000000000046</Bytes>
   <ASCII> . . . . . . . . . . . . . . . F</ASCII>
   <Pos>1104</Pos>

That is the main difference compared with "middle old" versions 2.0-95
(estimated 2.0-3.0) and described by pub-v2.trid.xml. There the first byte in
CLSID is 0 instead of 1. When i apply the same procedure steps there that
construct looks like:
   <Bytes>001202000000000000C0000000000046</Bytes>
   <ASCII> . . . . . . . . . . . . . . . F</ASCII>
   <Pos>1104</Pos>

Because pub-v2.trid.xml is based on fewer (3) examples i got more patterns.
Inside global strings sections i get lines like:
   <String>NO STYLE</String>
   <String>SYMBOL</String>
   <String>TIMES NEW ROMAN</String>
These are apparently triggered by using the sample fonts. So i delete these
patterns.

Here also exist a directory entry "\001CompObj". When i look inside this
stream for example by Michal Mutl Structured Storage Viewer `SSView.exe
MSPublisher95.pub` i see two "version" strings "Microsoft Publisher" and
"MSPublisher.2". So these observations are expressed inside global strings
section by lines like:
   <String>MICROSOFT PUBLISHER</String>
   <String>MSPUBLISHER.2</String>

Few "older" samples (like MSPublisher95.PUB) are not recognized by "generic"
pub.trid.xml. So i run tridscan to update this definition. Then i look what
has changed. In global string section one line vanished. That looked like:
   <String>S'U'M'M'A'R'Y'I'N'F'O'R'M'A'T'I'O'N</String>

Apparently in older this information stream is missing. Furthermore this
definition only applies to Compound base samples. So version 1 is definitively
not matched. So i mention my observations in the remark line. I also do not
know if this applies to newest versions like Microsoft Publisher 365.
Furthermore i add here mime type i used in other variants.

With the new and updated definitions all of my inspected Microsoft Publisher
document samples are still described, but now i get also a sub classification
(see appended trid-v-new.txt trid-new.txt in output).

TrID definitions and output are stored in archive pub_trid.zip. I hope that my
definitions can be used in future version of triddefs.

With best wishes
J?rg Jenderek

Mark0

  • Administrator
  • Hero Member
  • *****
  • Posts: 2743
    • Mark0's Home Page
Re: 3 variants pub-v?.trid.xml for Microsoft Publisher document
« Reply #1 on: June 03, 2024, 11:08:27 PM »
Thanks for the new/updated defs and the info.